Tensorflow graph inside class - how to manage sessions and scopes - python

I am trying to build a generic tensorflow infrastructure wrapped inside a simple one layer NN class (see code below).
I will be creating many NNets so I was wondering what was the best way to manage the sessions and the variables.
Typically, I'd like to get tf.trainable_variables() for only one network, not all of them (in the "show" function) so that I can print the network I want.
I also have to pass the session variable "sess" to every function, so that the variables are not re-initialized.
I think I am not doing everything properly... Can someone help ?
class oneLayerNN:
"""
Implements a 1 hidden-layer neural network: y = W2 * ([W1 * x + b1]+) + b1
"""
def __init__(self, ...):
...
self.initOp = tf.global_variables_initializer()
def show(self, sess):
tvars = tf.trainable_variables()
tvals = sess.run(tvars)
for var, val in zip(tvars,tvals):
print(var.name, val)
print()
def initializeVariables(self, sess):
sess.run(self.initOp)
def forwardPropagation(self, sess, x):
labels = sess.run(self.yHat, feed_dict={self.x: x})
return labels
def train(self, sess, dataset, epochs, batchSize, debug=False, verbose=False):
dataset = dataset.batch(batchSize)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
for epoch in range(epochs):
sess.run(iterator.initializer)
while True:
try:
batch_x, batch_y = sess.run(next_element)
_, c = sess.run([self.optimizer, self.loss], feed_dict={self.x: batch_x, self.y: batch_y})
except tf.errors.OutOfRangeError:
break
with tf.Session() as sess:
network.initializeVariables(sess)
network.show(sess)

It is probably a matter of taste and of how you intend to use your objects.
If it is OK for you to limit your objects to deal with a single tf.Session (as in Keras — should cover basic needs and probably a bit beyond), then you could simply instantiate a single tf.Session via your preferred Singleton-like pattern (maybe just plain old functions like in Keras).

Thanks for your answers.
However I still have issues with the scopes of variables. How can I do to define variables as part as my object? I want to be able to do something like:
vars = network.getTrainableVariables()
And that should return only the variables defined in that object (not like tf.trainable_variables())
I can't find one example of a clean declaration of variables within a scope when using multiple networks at the same time (the scope being the name of the network for example).
At the moment when I run the code multiple times, it creates variables W,b, then W_1,b_1, then W_2,b_2 etc...
Also, I would like network.initialize() to initialize only the variables defined within this graph, not all variables in every network...
A solution would be to declare variables for network within scope 'name' and then be able to reset_default_graph within this 'name' scope but I am not able to do that.

I'd suggest using tf.keras.Model to manage state. Take a look at the subclassing section of the tf.keras documentation. There are training examples using Model.fit there, but you can also just call the object directly, and it will collect variables and losses for you in properties (variables, trainable_variables, losses, etc.).
Whatever you do, I'd separate the model definition (anything that manages Variable objects) from the training loop. And when defining the model, Variables should be attributes of the model definition object and created once (not necessarily in __init__, but protected by an if self.attribute is not None: self.attribute = tf.Variable(...)).

Related

optimizer.step() Not updating Model Weights/Parameters

I'm currently working on a solution via PyTorch. I'm not going to share the exact solution but I will provide code that reproduces the issue I'm having.
I have a model defined as follows:
class Net(nn.Module):
def __init__(self):
super(Net,self).__init__()
self.fc1 = nn.Linear(10,4)
def foward(self,x):
return nn.functional.relu(self.fc1(x))
Then I create a instance: my_model = Net(). Next I create an Adam optimizer as such:
optim = Adam(my_model.parameters())
# create a random input
inputs = torch.tensor(np.array([1,1,1,1,1,2,2,2,2,2]),dtype=torch.float32,requires_grad=True)
# get the outputs
outputs = my_model(inputs)
# compute gradients / backprop via
outputs.backward(gradient=torch.tensor([1.,1.,1.,5.]))
# store parameters before optimizer step
before_step = list(my_model.parameters())[0].detach().numpy()
# update parameters via
optim.step()
# collect parameters again
after_step = list(my_model.parameters())[0].detach().numpy()
# Print if parameters are the same or not
print(np.array_equal(before_step,after_step)) # Prints True
I provided my models parameters to the Adam optimizer, so I'm not exactly sure why the parameters aren't updating. I know in most cases one uses a loss function, however I cannot do that in my case but I assumed if I specified model paramters to the optimizers, it would know to connect the two.
Anyone know why the parameters aren't getting updated?
The problem is with detach (docs).
As noted at the bottom:
Returned Tensor shares the same storage with the original one. In-place modifications on either of them will be seen, and may trigger errors in correctness checks
So that is exactly what's happening here. To correctly compare the parameters, you need to clone (docs) them to get a real copy.
list(my_model.parameters())[0].clone().detach().numpy()
On a side note, it can be helpful if you check the gradients after optim.step() with print(list(my_model.parameters())[0].grad) to check if the graph is intact. Also, don't forget to call optim.zero_grad().

Memory leak for custom tensorflow training using #tf.function

I am trying to write my own training loop for TF2/Keras, following the official Keras walkthrough. The vanilla version works like a charm, but when I try to add the #tf.function decorator to my training step, some memory leak grabs all my memory and I lose control of my machine, does anyone know what is going on?.
The important parts of the code look like this:
#tf.function
def train_step(x, y):
with tf.GradientTape() as tape:
logits = siamese_network(x, training=True)
loss_value = loss_fn(y, logits)
grads = tape.gradient(loss_value, siamese_network.trainable_weights)
optimizer.apply_gradients(zip(grads, siamese_network.trainable_weights))
train_acc_metric.update_state(y, logits)
return loss_value
#tf.function
def test_step(x, y):
val_logits = siamese_network(x, training=False)
val_acc_metric.update_state(y, val_logits)
val_prec_metric.update_state(y_batch_val, val_logits)
val_rec_metric.update_state(y_batch_val, val_logits)
for epoch in range(epochs):
step_time = 0
epoch_time = time.time()
print("Start of {} epoch".format(epoch))
for step, (x_batch_train, y_batch_train) in enumerate(train_ds):
if step > steps_epoch:
break
loss_value = train_step(x_batch_train, y_batch_train)
train_acc = train_acc_metric.result()
train_acc_metric.reset_states()
for val_step,(x_batch_val, y_batch_val) in enumerate(test_ds):
if val_step>validation_steps:
break
test_step(x_batch_val, y_batch_val)
val_acc = val_acc_metric.result()
val_prec = val_prec_metric.result()
val_rec = val_rec_metric.result()
val_acc_metric.reset_states()
val_prec_metric.reset_states()
val_rec_metric.reset_states()
If I comment on the #tf.function lines, the memory leak doesn't occur, but the step time is 3 times slower. My guess is that somehow the graph is bean created again within each epoch or something like that, but I have no idea how to solve it.
This is the tutorial I am following: https://keras.io/guides/writing_a_training_loop_from_scratch/
tl;dr;
TensorFlow may be generating a new graph for each unique set of argument values passed into the decorated functions. Make sure you are passing consistently-shaped Tensor objects to test_step and train_step instead of python objects.
Details
This is a stab in the dark. While I've never tried #tf.function, I did find the following warnings in the documentation:
tf.function also treats any pure Python value as opaque objects, and builds a separate graph for each set of Python arguments that it encounters.
and
Caution: Passing python scalars or lists as arguments to tf.function will always build a new graph. To avoid this, pass numeric arguments as Tensors whenever possible
Finally:
A Function determines whether to reuse a traced ConcreteFunction by computing a cache key from an input's args and kwargs. A cache key is a key that identifies a ConcreteFunction based on the input args and kwargs of the Function call, according to the following rules (which may change): The key generated for a tf.Tensor is its shape and dtype. The key generated for a tf.Variable is a unique variable id. The key generated for a Python primitive (like int, float, str) is its value. The key generated for nested dicts, lists, tuples, namedtuples, and attrs is the flattened tuple of leaf-keys (see nest.flatten). (As a result of this flattening, calling a concrete function with a different nesting structure than the one used during tracing will result in a TypeError). For all other Python types the key is unique to the object. This way a function or method is traced independently for each instance it is called with.
What I get from all this is that if you don't pass in a consistently-sized Tensor object to your #tf.function-ified function (perhaps you use Python collections or primitives instead), it is likely that you are creating a new graph version of your function with every distinct argument value you pass in. I'm guessing this could create the memory explosion behavior you're seeing. I can't tell how your test_ds and train_ds objects are being created, but you might want to make sure that they are created such that enumerate(blah_ds) returns tensors like in the tutorial, or at least convert the values to tensors before passing to your test_step and train_step functions.

Can't restore tensorflow variables

I have a class as follows and the load function returns me the tensorflow saved graph.
class StoredGraph():
.
.
.
def build_meta_saver(self, meta_file=None):
meta_file = self._get_latest_checkpoint() + '.meta' if not meta_file else meta_file
meta_saver = tf.train.import_meta_graph(meta_file)
return meta_saver
def load(self, sess, saverObj):
saverObj.restore(sess, self._get_latest_checkpoint())
graph = tf.get_default_graph()
return graph
I have another class lets call it TrainNet().
class TrainNet():
.
.
.
def train(dataset):
self.train_graph = tf.Graph()
meta_saver, saver = None, None
GraphIO = StoredGraph(experiment_dir)
latest_checkpoint = GraphIO._get_latest_checkpoint()
with self.train_graph.as_default():
tf.set_random_seed(42)
if not latest_checkpoint:
#build graph here
self.build_graph()
else:
meta_saver = GraphIO.build_meta_saver() # this loads the meta file
with tf.Session(graph=self.train_graph) as sess:
if not meta_saver:
sess.run(tf.global_variables_initializer())
if latest_checkpoint:
self.scaler, self.train_graph = GraphIO.load(sess, meta_saver)
#here access placeholders using self.train_graph.get_tensor_by_name()...
#and feed the values
In my training class I use the above class simply by loading the graph using load function as self.train_graph = StoredGraphclass.load(sess,metasaver)
My doubt is are all the variables restored by loading the saved graph ? Normally everyone defines the restoration operation in the same script like saver.restore() which restores all the variables of the graph. But I am calling saver.restore()in a different class and using the returned graph to access placeholders.
I think this way not all the variables are restored. Is the above approach wrong ? This doubt arose when I checked the values of weights in two different .meta files written at different training steps, and the values were exactly the same meaning this variable wasnt updated or the restoration method has some fault.
As long as you have created all the necessary variables in your file and given them the same "name" (and of course the shape needs to be correct as well), restore will load all the appropriate values into the appropriate variables. Here you can find a toy example showing you how this can be done.

How to apply Optimizer on Variable in Chainer?

Here is an example in Pytorch:
optimizer = optim.Adam([modifier_var], lr=0.0005)
And here in Tensorflow:
self.train = self.optimizer.minimize(self.loss, var_list=[self.modifier])
But Chainer's optimizers only can use on 'Link', how can I apply Optimizer on Variable in Chainer?
In short, there is no way to directly assign chainer.Variable (even nor chainer.Parameter) to chainer.Optimizer.
The following is some redundant explanation.
First, I re-define Variable and Parameter to avoid confusion.
Variable is (1) torch.Tensor in PyTorch v4, (2) torch.autograd.Variable in PyTorch v3, and (3) chainer.Variable in Chainer v4.
Variable is an object who holds two tensors; .data and .grad. It is the necessary and sufficient condition, so Variable is not necessarily a learnable parameter, which is a target of the optimizer.
In both libraries, there is another class Parameter, which is similar but not the same with Variable. Parameter is torch.autograd.Parameter in Pytorch and chainer.Parameter in Chainer.
Parameter must be a learnable parameter and should be optimized.
Therefore, there should be no case to register Variable (not Parameter) to Optimizer (although PyTorch allows to register Variable to Optimizer: this is just for backward compatibility).
Second, in PyTorch torch.nn.Optimizer directly optimizes Parameter, but in Chainer chainer.Optimizer DOES NOT optimize Parameter: instead, chainer.UpdateRule does. The Optimizer just registers UpdateRules to Parameters in a Link.
Therefore, it is only natural that chainer.Optimizer does not receive Parameter as its arguments, because it is just a "delivery-man" of UpdateRule.
If you want to attach different UpdateRule for each Parameter, you should directly create an instance of UpdateRule subclass, and attach it to the Parameter.
Below is an example to learn regression task by MyChain MLP model using Adam optimizer in Chainer.
from chainer import Chain, Variable
# Prepare your model (neural network) as `Link` or `Chain`
class MyChain(Chain):
def __init__(self):
super(MyChain, self).__init__(
l1=L.Linear(None, 30),
l2=L.Linear(None, 30),
l3=L.Linear(None, 1)
)
def __call__(self, x):
h = self.l1(x)
h = self.l2(F.sigmoid(h))
return self.l3(F.sigmoid(h))
model = MyChain()
# Then you can instantiate optimizer
optimizer = chainer.optimizers.Adam()
# Register model to optimizer (to indicate which parameter to update)
optimizer.setup(model)
# Calculate loss, and update parameter as follows.
def lossfun(x, y):
loss = F.mean_squared_error(model(x), y)
return loss
# this iteration is "training", to fit the model into desired function.
for i in range(300):
optimizer.update(lossfun, x, y)
So in summary, you need to setup the model, after that you can use update function to calculate loss and update model's parameter.
The above code comes from here
Also, there are other way to write training code using Trainer module. For more detailed tutorial of Chainer, please refer below
chainer-handson
deep-learning-tutorial-with-chainer

Restore subset of variables in Tensorflow

I am training a Generative Adversarial Network (GAN) in tensorflow, where basically we have two different networks each one with its own optimizer.
self.G, self.layer = self.generator(self.inputCT,batch_size_tf)
self.D, self.D_logits = self.discriminator(self.GT_1hot)
...
self.g_optim = tf.train.MomentumOptimizer(self.learning_rate_tensor, 0.9).minimize(self.g_loss, global_step=self.global_step)
self.d_optim = tf.train.AdamOptimizer(self.learning_rate, beta1=0.5) \
.minimize(self.d_loss, var_list=self.d_vars)
The problem is that I train one of the networks (g) first, and then, I want to train g and d together. However, when I call the load function:
self.sess.run(tf.initialize_all_variables())
self.sess.graph.finalize()
self.load(self.checkpoint_dir)
def load(self, checkpoint_dir):
print(" [*] Reading checkpoints...")
ckpt = tf.train.get_checkpoint_state(checkpoint_dir)
if ckpt and ckpt.model_checkpoint_path:
ckpt_name = os.path.basename(ckpt.model_checkpoint_path)
self.saver.restore(self.sess, ckpt.model_checkpoint_path)
return True
else:
return False
I have an error like this (with a lot more traceback):
Tensor name "beta2_power" not found in checkpoint files checkpoint/MR2CT.model-96000
I can restore the g network and keep training with that function, but when I want to star d from scratch, and g from the the stored model I have that error.
To restore a subset of variables, you must create a new tf.train.Saver and pass it a specific list of variables to restore in the optional var_list argument.
By default, a tf.train.Saver will create ops that (i) save every variable in your graph when you call saver.save() and (ii) lookup (by name) every variable in the given checkpoint when you call saver.restore(). While this works for most common scenarios, you have to provide more information to work with specific subsets of the variables:
If you only want to restore a subset of the variables, you can get a list of these variables by calling tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=G_NETWORK_PREFIX), assuming that you put the "g" network in a common with tf.name_scope(G_NETWORK_PREFIX): or tf.variable_scope(G_NETWORK_PREFIX): block. You can then pass this list to the tf.train.Saver constructor.
If you want to restore a subset of the variable and/or they variables in the checkpoint have different names, you can pass a dictionary as the var_list argument. By default, each variable in a checkpoint is associated with a key, which is the value of its tf.Variable.name property. If the name is different in the target graph (e.g. because you added a scope prefix), you can specify a dictionary that maps string keys (in the checkpoint file) to tf.Variable objects (in the target graph).
I had a similar problem when restoring only part of my variables from a checkpoint and some of the saved variables did not exist in the new model.
Inspired by #Lidong answer I modified a little the reading function:
def get_tensors_in_checkpoint_file(file_name,all_tensors=True,tensor_name=None):
varlist=[]
var_value =[]
reader = pywrap_tensorflow.NewCheckpointReader(file_name)
if all_tensors:
var_to_shape_map = reader.get_variable_to_shape_map()
for key in sorted(var_to_shape_map):
varlist.append(key)
var_value.append(reader.get_tensor(key))
else:
varlist.append(tensor_name)
var_value.append(reader.get_tensor(tensor_name))
return (varlist, var_value)
and added a loading function:
def build_tensors_in_checkpoint_file(loaded_tensors):
full_var_list = list()
# Loop all loaded tensors
for i, tensor_name in enumerate(loaded_tensors[0]):
# Extract tensor
try:
tensor_aux = tf.get_default_graph().get_tensor_by_name(tensor_name+":0")
except:
print('Not found: '+tensor_name)
full_var_list.append(tensor_aux)
return full_var_list
Then you can simply load all common variables using:
CHECKPOINT_NAME = path to save file
restored_vars = get_tensors_in_checkpoint_file(file_name=CHECKPOINT_NAME)
tensors_to_load = build_tensors_in_checkpoint_file(restored_vars)
loader = tf.train.Saver(tensors_to_load)
loader.restore(sess, CHECKPOINT_NAME)
Edit: I am using tensorflow 1.2
Inspired by #mrry, I propose a solution for this problem.
To make it clear, I formulate the problem as restoring a subset of the variable from the checkpoint, when the model is built on a pre-trained model.
First, we should use print_tensors_in_checkpoint_file function from the library inspect_checkpoint or just simply extract this function by:
from tensorflow.python import pywrap_tensorflow
def print_tensors_in_checkpoint_file(file_name, tensor_name, all_tensors):
varlist=[]
reader = pywrap_tensorflow.NewCheckpointReader(file_name)
if all_tensors:
var_to_shape_map = reader.get_variable_to_shape_map()
for key in sorted(var_to_shape_map):
varlist.append(key)
return varlist
varlist=print_tensors_in_checkpoint_file(file_name=the path of the ckpt file,all_tensors=True,tensor_name=None)
Then we use tf.get_collection() just like #mrry saied:
variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES)
Finally, we can initialize the saver by:
saver = tf.train.Saver(variable[:len(varlist)])
The complete version can be found at my github: https://github.com/pobingwanghai/tensorflow_trick/blob/master/restore_from_checkpoint.py
In my situation, the new variables are added at the end of the model, so I can simply use [:length()] to identify the needed variables, for a more complex situation, you might have to do some hand-alignment work or write a simple string matching function to determine the required variables.
You can create a separate instance of tf.train.Saver() with the var_list argument set to the variables you want to restore.
And create a separate instance to save the variables

Categories

Resources