I'm having a lot of trouble understanding the proper use of tf.train.Saver
I have a session where I create several distinct and separate network models. All models are trained and I save the best performing networks for later use.
However, when I try to restore a model at a later time I get an error which seems to indicate that some variables are either not getting saved or restored:
NotFoundError: Tensor name "Network_8/train/beta2_power" not found in checkpoint files networks/network_0.ckpt
for some reason, when I try and load the variables for Network_0 I'm being told I need variable information for Network_8.
What is the best way to make sure I only save/restore the correct variables from a multi-network session?
It seems part of my problem is that, while I have created a dict object for the Variables I want to save (weights and biases) for each network, when I setup an optimizer such as the AdamOptimizer tensorflow automatically creates extra variables which need to be initialized. This is fine if you use tf.train.Saver to save ALL variables and you only have one network, however I am training multiple networks and only saving the best results. I'm not sure how to specify the variables tf auto adds to my dict for saving.
My solution is to create a part_saver with the same tensor name both in the original model and the new model (i.e. Network_0 and Network_8) which only restores the needed variables.
part_saver = tf.train.Saver({"W":w,"b":b,...})
Init all the variables in Network_8 before restoring the partial model.
Related
I'm trying to modify a program that uses the Estimator class in TensorFlow (v1.10) and I would like to access the evaluation metric results every time evaluation occurs so that I can copy the checkpoint files only when a new maximum has been achieved.
One idea I had was to create a class inheriting from SessionRunHook, doing the work I want in the after_run method. According to the documentation I can specify what is passed to after_run using before_run. However I cannot find a way to access the evaluation metrics results I want from the information passed in to before_run.
I looked into the Estimator code and it appears that it is writing the results to a summary file so another idea I had was to read this back in the after_run method, but the summary api doesn't seem to provide any read operations.
Are there any other ways I can achieve what I want to do? Not using the Estimator class is not an option as that would involve drastic changes to the code I'm working with.
Checkpoints are not the same as exporting. Checkpoints are about fault-recovery and involve saving the complete training state (weights, global step number, etc.).
In your case I would recommend exporting. The exported model will written to a directory called “exporter” and the serving input function specifies what the end-user will be expected to provide to the prediction service.
You can use the class "Best Exporter" to just export the models that are perfoming best:
https://www.tensorflow.org/api_docs/python/tf/estimator/BestExporter
This class exports the serving graph and checkpoints of the best models.
Also, it performs a model export everytime when the new model is better than any exsiting model.
I'm trying to load three different models in the same process. Only the first one works as expected, the rest of them return like random results.
Basically the order is as follows:
define and compile first model
load trained weights before
rename layers
the same process for the second model
the same process for the third model
So, something like:
model1 = Model(inputs=Input(shape=input_size_im) , outputs=layers_firstmodel)
model1.compile(optimizer='sgd', loss='mse')
model1.load_weights(weights_first, by_name=True)
# rename layers but didn't work
model2 = Model(inputs=Input(shape=input_size_im) , outputs=layers_secondmodel)
model2.compile(optimizer='sgd', loss='mse')
model2.load_weights(weights_second, by_name=True)
# rename layers but didn't work
model3 = Model(inputs=Input(shape=input_size_im) , outputs=layers_thirdmodel)
model3.compile(optimizer='sgd', loss='mse')
model3.load_weights(weights_third, by_name=True)
# rename layers but didn't work
for im in list_images:
results_firstmodel = model1.predict(im)
results_secondmodel = model2.predict(im)
results_thirdmodel = model2.predict(im)
I'd like to perform some inference over a bunch of images. To do that the idea consists in looping over the images and perform inference with these three algorithms, and return the results.
I have tried to rename all layers to make them unique with no success. Also I created a different graph for each network, and with a different session do the inference. This works but it's very inefficient (in addition I have to set their weights every time because of sess.run(tf.global_variables_initializer()) removes them). Each time it's created a session tensorflow prints "creating tensorflow device (/device:GPU:0)".
I am running Tensorflow 1.4.0-rc0, Keras 2.1.1 and Ubuntu 16.04 kernel 4.14.
The OP is correct here. There is a serious bug when you try to load multiple weight files in the same script. The above answer doesn't solve this. If you actually interrogate the weights when loading weights for multiple models in the same script you will notice that the weights are different than when you just load weights for one model on its own. This is where the randomness is the OP observes coming from.
EDIT: To solve this problem you have to encapsulate the model.load_weight command within a function and the randomness that you are experiencing should go away. The problem is that something weird screws up when you have multiple load_weight commands in the same script like you have above. If you load those model weights with a function you issues should go away.
From the Keras docs we have this explanation for the user of load_weights:
loads the weights of the model from a HDF5 file (created by save_weights). By default, the architecture is expected to be unchanged. To load weights into a different architecture (with some layers in common), use by_name=True to load only those layers with the same name.
Therefore, if your architecture is unchanged you should drop the by_name=True or make it False (its default value). This could be causing the inconsistencies that you are facing, as your weights are not being loaded probably due to having different names on your layers.
Another important thing to consider is the nature of your HDF5 file, and the way you created it. If it indeed contains only the weights (created with save_weights as the docs point out) then there should be no problem in proceeding as explained before.
Now, if that HDF5 contains weights and architecture in the same file, then you should be loading it with keras.models.load_model instead (further reading if you like here). If this is the case then this would also explain those inconsistencies.
As a side suggestion, I prefer to save my models using Callbacks, like the ModelCheckpoint or the EarlyStopping if you want to automatically determine when to stop training. This not only gives you greater flexibility when training and saving your models (as you can stop them on the optimal training epoch or when you desire), but also makes loading those models easily, as you can simply use the load_model method to load both architecture and weights to your desired variable.
Finally, here is one useful SO post where saving (and loading) Keras models is explained.
According to the documentation and numerous SO posts regarding this API, the saver object must be created using
saver = tf.train.Saver(...variables...)
I wanted to know if there is any way to automatically populate the (...variables...) without having to explicitly list all variables and ops used in my network.
Right now my network is only two layers so it is not a huge hassle, but it feels downright stone-age like to have to list all the variables manually.
The default initializer for tf.train.Saver will create an instance that saves/restores all saveable objects in your graph, which typically includes all of your model variables. Therefore you should be able to write:
saver = tf.train.Saver()
…and get the desired effect without too much trouble.
Tensorflow allows us to save/load model's structure, using method tf.train.write_graph, so that we can restore it in the future to continue our training session. However, I'm wondering that if this is necessary because I can create a module, e.g GraphDefinition.py, and use this module to re-create the model.
So, which is the better way to save the model structure or are there any rule of thumb that suggest which way should I use when saving a model?
First of all you have to understand, that tensorflow graph does not have current weights in it (until you save them manually there) and if you load model structure from graph.pb, you will start you train from the very beginning. But if you want to continue train or use your trained model, you have to save checkpoint (using tf Saver) with the values of the variables in it, not only the structure.
Check out this tread: Tensorflow: How to restore a previously saved model (python)
I know that there are countless questions on stack and github, etc. on how to restore a trained model in Tensorflow. I have read most of them (1,2,3).
I have almost exactly the same problem as 3 however I would like if possible to solve it in a different fashion as my training and my test need to be in separate scripts called from the shell and I do not want to add the exact same lines I used to define the graph in the test script so I cannot use tensorflow FLAGS and the other answers based on reruning the graph by hand.
I also do not want to sess.run every variables and manually map them by hands as it was explained as my graph is quite big (Using import_graph_def with the arguments input_map).
So I run some graph and train it in a specific script. Like for instance (but without the training part)
#Script 1
import tensorflow as tf
import cPickle as pickle
x=tf.Variable(42)
saver=tf.train.Saver()
sess=tf.Session()
#Saving the graph
graph_def=sess.graph_def
with open('graph.pkl','wb') as output:
pickle.dump(graph_def,output,HIGHEST_PROTOCOL)
#Training the model
sess.run(tf.initialize_all_variables())
#Saving the variables
saver.save(sess,"pretrained_model.ckpt")
I now have both graph and variables saved so I should be able to run my test model from another script even if I have extra training nodes in my graph.
#Script 2
import tensorflow as tf
import cPickle as pickle
sess=tf.Session()
with open('graph.pkl','rb') as input:
graph_def=pickle.load(input)
tf.import_graph_def(graph_def,name='persisted')
Then obviously I want to restore the variables using a saver but I encounter the same problem as 3 as there are no variables found to save to even create a saver. So I cannot write:
saver=tf.train.Saver()
saver.restore(sess,"pretrained_model.ckpt")
Is there a way to bypass those limitations ? I thought by importing graph it would recover the uninitialized variables in every node but it seems not. Do I really need to rerun it a second time like most of the answers given ?
The list of variables is saved in a Collection which is not saved in the GraphDef. Saver by default uses the list in ops.GraphKeys.VARIABLES collection (accessible through tf.all_variables()), and if you restored from GraphDef rather than using Python API to build your model, that collection is empty. You could specify the list of variables manually in tf.train.Saver(var_list=['MyVariable1:0', 'MyVariable2:0',...]).
Alternatively instead of GraphDef you could use MetaGraphDef which saves collections, there's a recently added MetaGraphDef HowTo
To my knowledge and my tests you can't simply pass names to tf.train.Saver object. It must be either list of variables o dictionary.
I would also like to read model from graph_def and then load variables using saver - however attempting it results only in error message: "Variable to save is not a variable"