I should call sess.run(iterator.initializer) in where? - python

i want to try a code structure:
train_input_fn_try(): generate data, return feature dict and label; the data will produce from tf.data.Dataset.from_tensor_slices((tf.random_uniform([100,5]),tf.random_uniform([100],maxval=4,dtype=tf.int32)));
in main func: i called tf.estimator.DNNClassifier to get classifier, then called classifier.train(input_fn=lambda :train_input_fn_try(batch_size=3),steps=6) to train.
but, i found that i must called sess.run(iterator.initializer) before iterator.get_next().
I don't know where sess.run(iterator.initializer) should be called without destroying the code structure? in main func or in train_input_fn_try func? and how to do?
Here is an example of code that can't work:
def train_input_fn_try(batch_size=2,epoch=1,shuffle=True):
dataset=tf.data.Dataset.from_tensor_slices((tf.random_uniform([100,5]),tf.random_uniform([100],maxval=4,dtype=tf.int32)))
if shuffle:
dataset=dataset.shuffle(10000)
dataset=dataset.repeat(epoch)
dataset=dataset.batch(batch_size)
iterator=dataset.make_initializable_iterator()
with tf.Session() as sess:
sess.run(iterator.initializer)
text,label=iterator.get_next()
return {"text":text},label
with tf.Session() as sess:
my_feature_columns=[]
my_feature_columns.append(tf.feature_column.numeric_column(key="text",shape=[5]))
clf=tf.estimator.DNNClassifier(feature_columns=my_feature_columns,
hidden_units=[10,10],n_classes=4)
clf.train(input_fn=lambda :train_input_fn_try(batch_size=3),steps=6)
the runtime error is:
FailedPreconditionError (see above for traceback): GetNext() failed because the iterator has not been initialized. Ensure that you have run the initializer operation for this iterator before getting the next element.
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,5], [?]], output_types=[DT_FLOAT, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[Node: dnn/head/assert_range/assert_less/Assert/Assert/_106 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_83_dnn/head/assert_range/assert_less/Assert/Assert", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

Setting aside #Lescurel's very on-point comment, the problem you have is that you initialize the iterator in a different session than the one you then try to train:
def train_input_fn_try(batch_size=2,epoch=1,shuffle=True):
# [...]
with tf.Session() as sess: # <<<<<<<<<<<<<<<<<<<<<<< This session...
sess.run(iterator.initializer)
text,label=iterator.get_next()
return {"text":text},label
with tf.Session() as sess: # <<<<<<<<<<<<<<<<<<<<<<<<<<< ...is NOT the same as this one!
# [...]
The with tf.Session() as sess: statement creates a new instance of a session and assigns it to sess and that session is cosed once you exit the with statement.
For your code, the best solution is to just use a one-shot iterator, but if you really want to use the initializable one, pass sess as a parameter to train_input_fn_try and remove the with statement inside the function:
def train_input_fn_try(sess,batch_size=2,epoch=1,shuffle=True):
# [...]
sess.run(iterator.initializer)
# [...]
Update: why this still doesn't work (with Estimators)
The way the Estimator framework works is approximately:
Make new graph
call input_fn to set up the input pipeline in the new graph
call model_fn to set up the model in the new graph
make a Session and start a training loopp
When you make the lambda, the sess you pass is not the one that will be used by the estimator, so this won't work for you I'm afraid. I am not aware at the moment of ways to use other types of iterators with Estimators, you might have to stick to one-shot iterators.

Related

Memory leak for custom tensorflow training using #tf.function

I am trying to write my own training loop for TF2/Keras, following the official Keras walkthrough. The vanilla version works like a charm, but when I try to add the #tf.function decorator to my training step, some memory leak grabs all my memory and I lose control of my machine, does anyone know what is going on?.
The important parts of the code look like this:
#tf.function
def train_step(x, y):
with tf.GradientTape() as tape:
logits = siamese_network(x, training=True)
loss_value = loss_fn(y, logits)
grads = tape.gradient(loss_value, siamese_network.trainable_weights)
optimizer.apply_gradients(zip(grads, siamese_network.trainable_weights))
train_acc_metric.update_state(y, logits)
return loss_value
#tf.function
def test_step(x, y):
val_logits = siamese_network(x, training=False)
val_acc_metric.update_state(y, val_logits)
val_prec_metric.update_state(y_batch_val, val_logits)
val_rec_metric.update_state(y_batch_val, val_logits)
for epoch in range(epochs):
step_time = 0
epoch_time = time.time()
print("Start of {} epoch".format(epoch))
for step, (x_batch_train, y_batch_train) in enumerate(train_ds):
if step > steps_epoch:
break
loss_value = train_step(x_batch_train, y_batch_train)
train_acc = train_acc_metric.result()
train_acc_metric.reset_states()
for val_step,(x_batch_val, y_batch_val) in enumerate(test_ds):
if val_step>validation_steps:
break
test_step(x_batch_val, y_batch_val)
val_acc = val_acc_metric.result()
val_prec = val_prec_metric.result()
val_rec = val_rec_metric.result()
val_acc_metric.reset_states()
val_prec_metric.reset_states()
val_rec_metric.reset_states()
If I comment on the #tf.function lines, the memory leak doesn't occur, but the step time is 3 times slower. My guess is that somehow the graph is bean created again within each epoch or something like that, but I have no idea how to solve it.
This is the tutorial I am following: https://keras.io/guides/writing_a_training_loop_from_scratch/
tl;dr;
TensorFlow may be generating a new graph for each unique set of argument values passed into the decorated functions. Make sure you are passing consistently-shaped Tensor objects to test_step and train_step instead of python objects.
Details
This is a stab in the dark. While I've never tried #tf.function, I did find the following warnings in the documentation:
tf.function also treats any pure Python value as opaque objects, and builds a separate graph for each set of Python arguments that it encounters.
and
Caution: Passing python scalars or lists as arguments to tf.function will always build a new graph. To avoid this, pass numeric arguments as Tensors whenever possible
Finally:
A Function determines whether to reuse a traced ConcreteFunction by computing a cache key from an input's args and kwargs. A cache key is a key that identifies a ConcreteFunction based on the input args and kwargs of the Function call, according to the following rules (which may change): The key generated for a tf.Tensor is its shape and dtype. The key generated for a tf.Variable is a unique variable id. The key generated for a Python primitive (like int, float, str) is its value. The key generated for nested dicts, lists, tuples, namedtuples, and attrs is the flattened tuple of leaf-keys (see nest.flatten). (As a result of this flattening, calling a concrete function with a different nesting structure than the one used during tracing will result in a TypeError). For all other Python types the key is unique to the object. This way a function or method is traced independently for each instance it is called with.
What I get from all this is that if you don't pass in a consistently-sized Tensor object to your #tf.function-ified function (perhaps you use Python collections or primitives instead), it is likely that you are creating a new graph version of your function with every distinct argument value you pass in. I'm guessing this could create the memory explosion behavior you're seeing. I can't tell how your test_ds and train_ds objects are being created, but you might want to make sure that they are created such that enumerate(blah_ds) returns tensors like in the tutorial, or at least convert the values to tensors before passing to your test_step and train_step functions.

Tensorflow graph inside class - how to manage sessions and scopes

I am trying to build a generic tensorflow infrastructure wrapped inside a simple one layer NN class (see code below).
I will be creating many NNets so I was wondering what was the best way to manage the sessions and the variables.
Typically, I'd like to get tf.trainable_variables() for only one network, not all of them (in the "show" function) so that I can print the network I want.
I also have to pass the session variable "sess" to every function, so that the variables are not re-initialized.
I think I am not doing everything properly... Can someone help ?
class oneLayerNN:
"""
Implements a 1 hidden-layer neural network: y = W2 * ([W1 * x + b1]+) + b1
"""
def __init__(self, ...):
...
self.initOp = tf.global_variables_initializer()
def show(self, sess):
tvars = tf.trainable_variables()
tvals = sess.run(tvars)
for var, val in zip(tvars,tvals):
print(var.name, val)
print()
def initializeVariables(self, sess):
sess.run(self.initOp)
def forwardPropagation(self, sess, x):
labels = sess.run(self.yHat, feed_dict={self.x: x})
return labels
def train(self, sess, dataset, epochs, batchSize, debug=False, verbose=False):
dataset = dataset.batch(batchSize)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
for epoch in range(epochs):
sess.run(iterator.initializer)
while True:
try:
batch_x, batch_y = sess.run(next_element)
_, c = sess.run([self.optimizer, self.loss], feed_dict={self.x: batch_x, self.y: batch_y})
except tf.errors.OutOfRangeError:
break
with tf.Session() as sess:
network.initializeVariables(sess)
network.show(sess)
It is probably a matter of taste and of how you intend to use your objects.
If it is OK for you to limit your objects to deal with a single tf.Session (as in Keras — should cover basic needs and probably a bit beyond), then you could simply instantiate a single tf.Session via your preferred Singleton-like pattern (maybe just plain old functions like in Keras).
Thanks for your answers.
However I still have issues with the scopes of variables. How can I do to define variables as part as my object? I want to be able to do something like:
vars = network.getTrainableVariables()
And that should return only the variables defined in that object (not like tf.trainable_variables())
I can't find one example of a clean declaration of variables within a scope when using multiple networks at the same time (the scope being the name of the network for example).
At the moment when I run the code multiple times, it creates variables W,b, then W_1,b_1, then W_2,b_2 etc...
Also, I would like network.initialize() to initialize only the variables defined within this graph, not all variables in every network...
A solution would be to declare variables for network within scope 'name' and then be able to reset_default_graph within this 'name' scope but I am not able to do that.
I'd suggest using tf.keras.Model to manage state. Take a look at the subclassing section of the tf.keras documentation. There are training examples using Model.fit there, but you can also just call the object directly, and it will collect variables and losses for you in properties (variables, trainable_variables, losses, etc.).
Whatever you do, I'd separate the model definition (anything that manages Variable objects) from the training loop. And when defining the model, Variables should be attributes of the model definition object and created once (not necessarily in __init__, but protected by an if self.attribute is not None: self.attribute = tf.Variable(...)).

Saving model and initialization in Keras

I have created a model in Keras, which I then initialised by calling
session=tf.Session()
session.run(tf.global_variables_initializer())
After training, I tried to save the model by running
saver = tf.train.Saver()
saver.save(session, "action_inference_cart_pole_plan16_5000episode.ckpt")
However, it keeps returning this error
FailedPreconditionError: Attempting to use uninitialized value dense_241/kernel
[[Node: dense_241/kernel/_21554 = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_1854_dense_241/kernel", _device="/job:localhost/replica:0/task:0/gpu:0"](dense_241/kernel)]]
[[Node: dense_284/bias/_21741 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_1947_dense_284/bias", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](^_arg_save_15/Const_0_0, ^save_15/SaveV2/tensor_names, ^save_15/SaveV2/shape_and_slices)]]
I have tried to manually initialize the variables that failed, and that worked once before. However, now there are different variables, and I can't even find them. I would like to understand why this is happening.
Here is the full code
Keras usually has it's own built-in model save and load methods. When training keras models, you should
use them instead of the TF saver, since keras has its own meta computation graph, that should probably be initialized when loading a model.
Here is an example (copied from the keras documentation) for how to save and load a keras model
from keras.models import load_model
model.save('my_model.h5') # creates a HDF5 file 'my_model.h5'
del model # deletes the existing model
# returns a compiled model
# identical to the previous one
model = load_model('my_model.h5')

TensorFlow FailedPreconditionError when using variables from the tf.metric module [duplicate]

This question already has answers here:
FailedPreconditionError: Attempting to use uninitialized in Tensorflow
(11 answers)
Closed 4 years ago.
I tried to add some additional measurements to my training code for a CNN by utilising the functions from the tf.metrics submodule, such as tf.metrics.accuracy(y_labels, y_predicted) and equivalents for precision or recall. This is done in contrast to most of their tutorials where they suggest the convoluted:
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
Whereas my implementation replaces this line with:
accuracy = tf.metrics.accuracy(y_labels, y_predicted)
Now, even though I do the sess.run(tf.initialize_all_variables()) within my with tf.Session() as sess: block, I still get the following error when trying to use tf.metrics.accuracy function:
FailedPreconditionError (see above for traceback): Attempting to use uninitialized value performance/accuracy/count
[[Node: performance/accuracy/count/read = Identity[T=DT_FLOAT, _class=["loc:#performance/accuracy/count"], _device="/job:localhost/replica:0/task:0/cpu:0"](performance/accuracy/count)]]
Most notably, replacing the accuracy = tf.metrics.accuracy(y_labels, y_predicted) line with accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) fixes the problem, however, I would like to implement other metrics such as precision, recall, etc. without doing it by hand.
TL;DR: Add the following line at the beginning of your session:
sess.run(tf.local_variables_initializer())
The confusion arises from the name of the (as frankyjuang points out) deprecated tf.initialize_all_variables() function. This function was deprecated in part because it is misnamed: it doesn't actually initialize all variables, and instead it only initializes global (not local) variables.
According to the documentation for the tf.metrics.accuracy() function (emphasis added):
The accuracy function creates two local variables, total and count that are used to compute the frequency with which predictions matches labels.
Therefore you need to add an explicit initialization step for the local variables, which can be done using tf.local_variables_initializer(), as suggested above.
sess.run(tf.initialize_all_variables()) is deprecated.
Use sess.run(tf.global_variables_initializer()) instead to resolve your issue.
Reference
According to doc of tf.initialize_all_variables,
THIS FUNCTION IS DEPRECATED. It will be removed after 2017-03-02. Instructions for updating: Use tf.global_variables_initializer instead.

Tensorflow: Using Adam optimizer

I am experimenting with some simple models in tensorflow, including one that looks very similar to the first MNIST for ML Beginners example, but with a somewhat larger dimensionality. I am able to use the gradient descent optimizer with no problems, getting good enough convergence. When I try to use the ADAM optimizer, I get errors like this:
tensorflow.python.framework.errors.FailedPreconditionError: Attempting to use uninitialized value Variable_21/Adam
[[Node: Adam_2/update_Variable_21/ApplyAdam = ApplyAdam[T=DT_FLOAT, use_locking=false, _device="/job:localhost/replica:0/task:0/cpu:0"](Variable_21, Variable_21/Adam, Variable_21/Adam_1, beta1_power_2, beta2_power_2, Adam_2/learning_rate, Adam_2/beta1, Adam_2/beta2, Adam_2/epsilon, gradients_11/add_10_grad/tuple/control_dependency_1)]]
where the specific variable that complains about being uninitialized changes depending on the run. What does this error mean? And what does it suggest is wrong? It seems to occur regardless of the learning rate I use.
The AdamOptimizer class creates additional variables, called "slots", to hold values for the "m" and "v" accumulators.
See the source here if you're curious, it's actually quite readable:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/adam.py#L39 . Other optimizers, such as Momentum and Adagrad use slots too.
These variables must be initialized before you can train a model.
The normal way to initialize variables is to call tf.initialize_all_variables() which adds ops to initialize the variables present in the graph when it is called.
(Aside: unlike its name suggests, initialize_all_variables() does not initialize anything, it only add ops that will initialize the variables when run.)
What you must do is call initialize_all_variables() after you have added the optimizer:
...build your model...
# Add the optimizer
train_op = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
# Add the ops to initialize variables. These will include
# the optimizer slots added by AdamOptimizer().
init_op = tf.initialize_all_variables()
# launch the graph in a session
sess = tf.Session()
# Actually intialize the variables
sess.run(init_op)
# now train your model
for ...:
sess.run(train_op)
FailedPreconditionError: Attempting to use uninitialized value is one of the most frequent errors related to tensorflow. From official documentation, FailedPreconditionError
This exception is most commonly raised when running an operation that
reads a tf.Variable before it has been initialized.
In your case the error even explains what variable was not initialized: Attempting to use uninitialized value Variable_1. One of the TF tutorials explains a lot about variables, their creation/initialization/saving/loading
Basically to initialize the variable you have 3 options:
initialize all global variables with tf.global_variables_initializer()
initialize variables you care about with tf.variables_initializer(list_of_vars). Notice that you can use this function to mimic global_variable_initializer: tf.variable_initializers(tf.global_variables())
initialize only one variable with var_name.initializer
I almost always use the first approach. Remember you should put it inside a session run. So you will get something like this:
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
If your are curious about more information about variables, read this documentation to know how to report_uninitialized_variables and check is_variable_initialized.
You need to call tf.global_variables_initializer() on you session, like
init = tf.global_variables_initializer()
sess.run(init)
Full example is available in this great tutorial
https://www.tensorflow.org/get_started/mnist/mechanics
run init after AdamOptimizer,and without define init before or run init
sess.run(tf.initialize_all_variables())
or
sess.run(tf.global_variables_initializer())
I was having a similar problem. (No problems training with GradientDescent optimizer, but error raised when using to Adam Optimizer, or any other optimizer with its own variables)
Changing to an interactive session solved this problem for me.
sess = tf.Session()
into
sess = tf.InteractiveSession()

Categories

Resources