So, I'm using a bunch of functions from OpenAI baselines for Reinforcement Learning. In those functions, policy nets are initialised using statements like:
with tf.variable_scope('deepq', reuse=True):
...
return output
The problem is that the pointer to the output of those networks gets returned while still inside the scope, which means that when accessing those functions from another .py file I am still inside those scopes.
Basically I want to run a first function train_policy(output_dir) that trains the net and dumps the checkpoint to disk using tf.Saver().
Next, I run a function run_policy(output_dir) that reinitializes the same tf Graph and loads it's pretrained values using the checkpoint dir.
Right now, when I try this, I get a ValueError:
"Variable deepq/... already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope?" because at the point of running the second function, I'm still in the scope defined by the first.. I checked the code from OpenAI baselines (very nested code, hard to see everything that's going on), and reuse is already set to True.
So I tried doing something like:
tf.get_default_session().close() followed by:
tf.reset_default_graph()
after the first function call. (I don't need the session to remain active since I'm dumping everything to disk)
But this gives me errors because I'm still inside a nested graph scope and so I can't reset the default graph... (see eg here)
Alternatively I tried things like:
tf.get_default_graph().as_graph_def().__exit__()
or
tf.name_scope('deepq').__exit__()
but the exit() function needs a whole bunch of args I don't know how to get... (and I can't find good documentation on how to use this function).
My current solution is to run these functions in separate subprocesses in Python (and let the garbage collector do all the work), but this doensn't feel like a satisfactory solution..
Any ideas on how to deal with this? Ideally I'd need something like: tf.clear_all_graphs_and_sessions()
Ait one solution is indeed to reset the default graph:
I simply wrap every function call in a new default graph object like this:
with tf.Graph().as_default():
train_policy(output_dir)
with tf.Graph().as_default():
run_policy(output_dir)
...
This way the default graph simply gets reinitialised empty and you can load whatever is in the checkpoint file. (Inside every function I also close the default session before returning).
You can try to do your work in another default graph:
with tf.get_default_graph().as_default():
with tf.variable_scope('deepq', reuse=False):
v = tf.get_variable('v', shape=[])
print(v.name, v.graph)
with tf.Graph().as_default():
v = tf.get_variable('v', shape=[])
print(v.name, v.graph)
Output:
deepq/v:0 <tensorflow.python.framework.ops.Graph object at 0x7f61adaa6390>
v:0 <tensorflow.python.framework.ops.Graph object at 0x7f61460abbd0>
Related
I am trying to introduce a mod/mixin for a problem. In particular I am focusing here on a SpeechRecognitionProblem. I intend to modify this problem and therefore I seek to do the following:
class SpeechRecognitionProblemMod(speech_recognition.SpeechRecognitionProblem):
def hparams(self, defaults, model_hparams):
SpeechRecognitionProblem.hparams(self, defaults, model_hparams)
vocab_size = self.feature_encoders(model_hparams.data_dir)['targets'].vocab_size
p = defaults
p.vocab_size['targets'] = vocab_size
def feature_encoders(self, data_dir):
# ...
So this one does not do much. It calls the hparams() function from the base class and then changes some values.
Now, there are already some ready-to-go problems e.g. Libri Speech:
#registry.register_problem()
class Librispeech(speech_recognition.SpeechRecognitionProblem):
# ..
However, in order to apply my modifications I am doing this:
#registry.register_problem()
class LibrispeechMod(SpeechRecognitionProblemMod, Librispeech):
# ..
This should, if I am not mistaken, overwrite everything (with identical signatures) in Librispeech and instead call functions of SpeechRecognitionProblemMod.
Since I was able to train a model with this code I am assuming that it's working as intended so far.
Now here comes the my problem:
After training I want to serialize the model. This usually works. However, it does not with my mod and I actually know why:
At a certain point hparams() gets called. Debugging to that point will show me the following:
self # {LibrispeechMod}
self.hparams # <bound method SpeechRecognitionProblem.hparams of ..>
self.feature_encoders # <bound method SpeechRecognitionProblemMod.feature_encoders of ..>
self.hparams should be <bound method SpeechRecognitionProblemMod.hparams of ..>! It would seem that for some reason hparams() of SpeechRecognitionProblem gets called directly instead of SpeechRecognitionProblemMod. But please note that it's the correct type for feature_encoders()!
The thing is that I know this is working during training. I can see that the hyper-paramaters (hparams) are applied accordingly simply because the model's graph node names change through my modifications.
There is one specialty I need to point out. tensor2tensor allows to dynamically load a t2t_usr_dir, which are additional python modules which get loaded by import_usr_dir. I make use of that function in my serialization script as well:
if usr_dir:
logging.info('Loading user dir %s' % usr_dir)
import_usr_dir(usr_dir)
This could be the only culprit I can see at the moment although I would not be able to tell why this may cause the problem.
If anybody sees something I do not I'd be glad to get a hint what I'm doing wrong here.
So what is the error you're getting?
For the sake of completeness, this is the result of the wrong hparams() method being called:
NotFoundError (see above for traceback): Restoring from checkpoint failed.
Key transformer/symbol_modality_256_256/softmax/weights_0 not found in checkpoint
symbol_modality_256_256 is wrong. It should be symbol_modality_<vocab-size>_256 where <vocab-size> is a vocabulary size which gets set in SpeechRecognitionProblemMod.hparams.
So, this weird behavior came from the fact that I was remote debugging and that the source files of the usr_dir were not correctly synchronized. Everything works as intended but the source files where not matching.
Case closed.
Is it possible to use tf.TensorArray's for reading and writing inside of the body of a tf.while_loop, but not pass them all through loop_vars?
I want to use tf.while_loop as part of a graph for WaveNet sound generation, which is a sequential generation mechanism that that generates the next amplitude value based on a window of previously generated ones. But, I want to use this just during inference, so no need for gradients, and would call with back_prop=False.
Besides this, the loop body function must read and write intermediate values that must be remembered across time steps.
It looks like tf.TensorArray is the only option for reading and writing values in this way, but I notice that tf.TensorArray.write() returns a new tf.TensorArray that is meant to be returned by the body and used in the loop_vars argument. Is this the best way to do this?
If I don't have a need for gradients, is there a simpler way to preserve state over the loop?
You can use tf.assign inside the tf.while to assign stuff inside a global variable.
EDIT: You cannot use tf.assign for assigning sliced indices of tf.Tensor, it is however allowed for tf.Variable. The argument sent to the while body is of type tf.Tensor and not tf.Variable so it wont work.
Here is some sample code.
import tensorflow as tf
x = tf.Variable([0,0])
assign_op = tf.assign(x[1],42)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print(sess.run(x)) # [0,0]
sess.run(assign_op)
print(sess.run(x)) # [0,42]
I am beginning to use TensorFlow for some simple Q-learning, but have run into trouble when trying to use variable scopes with layers constructed using tf.layers and tf.contrib.layers. In a nutshell, I want to apply the same layers to different input tensors (for example, to hold the current and next Q values). Here is a minimal example using tf.layers:
import tensorflow as tf
inp1 = tf.placeholder(tf.float64, (4,1))
inp2 = tf.placeholder(tf.float64, (4,1))
def process(inp):
with tf.variable_scope("foo", reuse=True):
return tf.layers.dense(inp, 12, name="bar", reuse=True)
process(inp1)
process(inp2)
Trying to execute this code gives the following exception:
ValueError: Variable foo/bar/kernel does not exist, or was not created with
tf.get_variable(). Did you mean to set reuse=None in VarScope?
I understand that setting reuse=True in tf.layers.dense() makes it try to find an already defined layer, which it may fail to do. But if I change the call into tf.layers.dense(inp, 12, name="bar"), then it fails with the same exception.
If I set reuse=None in tf.variable_scope(), then the latter version fails during the call of process(inp2) with the exception:
ValueError: Variable foo/bar/kernel already exists, disallowed.
Did you mean to set reuse=True in VarScope?
Unfortunately, similar errors occur when using tf.contrib.layers.
My question is: Is there a way to make tf.layers work with variable scopes? I know that I could define the weights and biases separately, but it would be nice to retain the abstraction given by tf.layers. Thanks a lot!
My setup is TensorFlow 1.3.0 (CPU) running with Python 3.6.1 on Windows 10 (installed through pip on 64-bit Anaconda 4.4.0).
P.S. I found the use of variable scopes for layers on page 17 of this presentation.
Two errors are different: the first one happened in process(inp1), where it tries to find the existed variables but there is not; the second happened in process(inp2), where the variable with same name existed but it tries to create a new variable with the same name, disallowed.
I guess that you want to reuse those variables for Q-learning. So the solution is quite simple: the first time you define those variables, don't use reuse, then you can set reuse=True.
In the presentation you gave, I guess they have already defined variables before.
This guide will help you under more.
I'm using the high level tf.contrib.learn.Experiment object to interleave training and evaluation. However, I'm facing an issue with the local variables from the evaluation and metrics modules that are reported as non initialized:
Variables not initialized: mean/total, mean/count, eval_step
I provide a custom local_init_op to tf.train.Scaffold which basically looks like this:
scaffold = tf.train.Scaffold(
local_init_op=tf.group(
iterator.initializer,
tf.tables_initializer(),
tf.local_variables_initializer()))
(where iterator is a tf.contrib.data.Iterator.)
which is then stored in a tf.estimator.EstimatorSpec to be returned by the tf.estimator.Estimator's model_fn function.
As I don't think tf.local_variables_initializer() operates lazily, it means these variables are not yet created.
So how to initialize them?
The only solution I found is to not use a custom local_init_op but rely on the default one which is built in Scaffold.finalize, when all variables are created.
To initialize my iterator I simply added it in the TABLE_INITIALIZERS collection:
tf.add_to_collection(tf.GraphKeys.TABLE_INITIALIZERS, iterator.initializer)
I'm trying to train an LSTM in Tensorflow using minibatches, but after training is complete I would like to use the model by submitting one example at a time to it. I can set up the graph within Tensorflow to train my LSTM network, but I can't use the trained result afterward in the way I want.
The setup code looks something like this:
#Build the LSTM model.
cellRaw = rnn_cell.BasicLSTMCell(LAYER_SIZE)
cellRaw = rnn_cell.MultiRNNCell([cellRaw] * NUM_LAYERS)
cell = rnn_cell.DropoutWrapper(cellRaw, output_keep_prob = 0.25)
input_data = tf.placeholder(dtype=tf.float32, shape=[SEQ_LENGTH, None, 3])
target_data = tf.placeholder(dtype=tf.float32, shape=[SEQ_LENGTH, None])
initial_state = cell.zero_state(batch_size=BATCH_SIZE, dtype=tf.float32)
with tf.variable_scope('rnnlm'):
output_w = tf.get_variable("output_w", [LAYER_SIZE, 6])
output_b = tf.get_variable("output_b", [6])
outputs, final_state = seq2seq.rnn_decoder(input_list, initial_state, cell, loop_function=None, scope='rnnlm')
output = tf.reshape(tf.concat(1, outputs), [-1, LAYER_SIZE])
output = tf.nn.xw_plus_b(output, output_w, output_b)
...Note the two placeholders, input_data and target_data. I haven't bothered including the optimizer setup. After training is complete and the training session closed, I would like to set up a new session that uses the trained LSTM network whose input is provided by a completely different placeholder, something like:
with tf.Session() as sess:
with tf.variable_scope("simulation", reuse=None):
cellSim = cellRaw
input_data_sim = tf.placeholder(dtype=tf.float32, shape=[1, 1, 3])
initial_state_sim = cell.zero_state(batch_size=1, dtype=tf.float32)
input_list_sim = tf.unpack(input_data_sim)
outputsSim, final_state_sim = seq2seq.rnn_decoder(input_list_sim, initial_state_sim, cellSim, loop_function=None, scope='rnnlm')
outputSim = tf.reshape(tf.concat(1, outputsSim), [-1, LAYER_SIZE])
with tf.variable_scope('rnnlm'):
output_w = tf.get_variable("output_w", [LAYER_SIZE, nOut])
output_b = tf.get_variable("output_b", [nOut])
outputSim = tf.nn.xw_plus_b(outputSim, output_w, output_b)
This second part returns the following error:
tensorflow.python.framework.errors.InvalidArgumentError: You must feed a value for placeholder tensor 'Placeholder' with dtype float
[[Node: Placeholder = Placeholder[dtype=DT_FLOAT, shape=[], _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
...Presumably because the graph I'm using still has the old training placeholders attached to the trained LSTM nodes. What's the right way to 'extract' the trained LSTM and put it into a new, different graph that has a different style of inputs? The Varible scoping features that Tensorflow has seem to address something like this, but the examples in the documentation all talk about using variable scope as a way of managing variable names so that the same piece of code will generate similar subgraphs within the same graph. The 'reuse' feature seems to be close to what I want, but I don't find the Tensorflow documentation linked above to be clear at all on what it does. The cells themselves cannot be given a name (in other words,
cellRaw = rnn_cell.MultiRNNCell([cellRaw] * NUM_LAYERS, name="multicell")
is not valid), and while I can give a name to a seq2seq.rnn_decoder(), I presumably wouldn't be able to remove the rnn_cell.DropoutWrapper() if I used that node unchanged.
Questions:
What is the proper way to move trained LSTM weights from one graph to another?
Is it correct to say that starting a new session "releases resources", but doesn't erase the graph built in memory?
It seems to me like the 'reuse' feature allows Tensorflow to search outside of the current variable scope for variables with the same name (existing in a different scope), and use them in the current scope. Is this correct? If it is, what happens to all of the graph edges from the non-current scope that link to that variable? If it isn't, why does Tensorflow throw an error if you try to have the same variable name within two different scopes? It seems perfectly reasonable to define two variables with identical names in two different scopes, e.g. conv1/sum1 and conv2/sum1.
In my code I'm working within a new scope but the graph won't run without data to be fed into a placeholder from the initial, default scope. Is the default scope always 'in-scope' for some reason?
If graph edges can span different scopes, and names in different scopes can't be shared unless they refer to the exact same node, then that would seem to defeat the purpose of having different scopes in the first place. What am I misunderstanding here?
Thanks!
What is the proper way to move trained LSTM weights from one graph to another?
You can create your decoding graph first (with a saver object to save the parameters) and create a GraphDef object that you can import in your bigger training graph:
basegraph = tf.Graph()
with basegraph.as_default():
***your graph***
traingraph = tf.Graph()
with traingraph.as_default():
tf.import_graph_def(basegraph.as_graph_def())
***your training graph***
make sure you load your variables when you start a session for a new graph.
I don't have experience with this functionality so you may have to look into it a bit more
Is it correct to say that starting a new session "releases resources", but doesn't erase the graph built in memory?
yep, the graph object still hold it
It seems to me like the 'reuse' feature allows Tensorflow to search outside of the current variable scope for variables with the same name (existing in a different scope), and use them in the current scope. Is this correct? If it is, what happens to all of the graph edges from the non-current scope that link to that variable? If it isn't, why does Tensorflow throw an error if you try to have the same variable name within two different scopes? It seems perfectly reasonable to define two variables with identical names in two different scopes, e.g. conv1/sum1 and conv2/sum1.
No, reuse is to determine the behaviour when you use get_variable on an existing name, when it is true it will return the existing variable, otherwise it will return a new one. Normally tensorflow should not throw an error. Are you sure your using tf.get_variable and not just tf.Variable?
In my code I'm working within a new scope but the graph won't run without data to be fed into a placeholder from the initial, default scope. Is the default scope always 'in-scope' for some reason?
I don't really see what you mean. The do not always have to be used. If a placeholder is not required for running an operation you don't have to define it.
If graph edges can span different scopes, and names in different scopes can't be shared unless they refer to the exact same node, then that would seem to defeat the purpose of having different scopes in the first place. What am I misunderstanding here?
I think your understanding or usage of scopes is flawed, see above