Tensorflow: how to use an estimator trained earlier

Tensorflow: how to use an estimator trained earlier - python

So I'm looking at the official tensorflow tutorial here:https://www.tensorflow.org/tutorials/layers
Basically, it teaches you how to train a classifier for the mnist dataset.
The complete code is pretty short and can be found here:
https://github.com/tensorflow/tensorflow/blob/r1.8/tensorflow/examples/tutorials/layers/cnn_mnist.py
I can run it without any problems. But I want to know how I can reuse the estimator trained in this file for some other program. The model is saved as 3 files: one .data-00000-of-00001 file, one .meta file and one .index file.
I googled and it seems you can load the model by:
sess=tf.Session()
saver = tf.train.import_meta_graph('my_model.meta')
saver.restore(sess,tf.train.latest_checkpoint('./'))
But how do I proceed from here? It seems weird the tutorial does not teach you how to reuse the estimator.

Importing meta graph does not explicitly load your nodes in meta-graph.
It only loads the nodes 'internally'. and this internal nodes have names you assigned before like
tf.truncated_normal([28, 128], stddev=1.0 / math.sqrt(float(28))), name="weights1")
In this case "weights1" is your 'internal' node name.
Then you should draw out(assign) your 'internal' node to 'external' variable (code level).
# bring out a "weights1" node from meta graph
fc_weights1 = tf.get_collection('weights1')
And like this way, you can do
logits = tf.get_collection('logits_node_before_you_named')
and you can write new loss, accuracy node you've done before.
In summary, you can pull out nodes you want to use to the surface(code level) by using tf.get_collection method.
P.S. Use of these 'internal', 'external' terms is not official, and is for convenience only.

Related

Increasing number of classes on TensorFlow checkpoint

For a project I'm working on I have to compare the performance of a small, simulated dataset to the performance of well-known datasets (e.g. ImageNet, CIFAR-10) using two neural networks, ResNet v2 at different depths and Inception v3.
To account for the imbalance between my set and the big popular ones, one of the methods I want to employ is augmenting the big sets using either a class from my simulated set or the same class but instead with real-life images. Then, using a pre-trained model, I want to train on both augmented sets and later compare performance.
The issue
The augmentation works fine: I copy over the .tfrecord files containing only images of the extra class to the original dataset and I modify the labels.txt file to contain the appropriate extra labels. However, the issue occurs when trying to train on this augmented dataset:
I0125 16:40:31.279634 140094914221888 coordinator.py:224] Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Assign requires shapes of both tensors to match. lhs shape= [258] rhs shape= [257]
which is to be expected, as the graph cannot handle the extra class.
My attempted solution
To alleviate this issue I tried applying the same method as was posed in this answer, using the Python script below (modified for brevity):
checkpoint = {}
saver = tf.compat.v1.train.import_meta_graph(f"{TRAIN_PATH}/model.ckpt-{CHECKPOINT_NUM}.meta")
with tf.Session() as sess:
saver.restore(sess, f"{TRAIN_PATH}/model.ckpt-{CHECKPOINT_NUM}")
# get appropriate nodes where number of classes is one of its dimensions
if "inception_v3" in TRAIN_PATH:
target_nodes = [
'<nodes where one of its dimensions is equal to the number of classes>'
]
elif "resnet_v2" in TRAIN_PATH:
target_nodes = [
'<nodes where one of its dimensions is equal to the number of classes>'
]
# select the nodes previously defined
nodes = [node for node in tf.global_variables() if node.name in target_nodes]
for node in nodes:
# load the values of those nodes from the checkpoint
checkpoint[node.name] = node.eval()
# extend dimensions for extra class
checkpoint.update({node.name: np.insert(checkpoint[node.name], checkpoint[node.name].shape[-1], 0.0, axis=-1)})
# assign new nodes to the checkpoint
sess.run(tf.assign(node, checkpoint[node.name], validate_shape=False))
saver.save(sess, f"{TRAIN_PATH}/model.ckpt-{CHECKPOINT_NUM}")
In short, this script loads the checkpoint and isolates all nodes where one of the dimensions is equal to the number of classes and increases it. I have verified these are always equal to the number of classes using the checkpoints from the different datasets the models have been trained on. In fact, it seems that those tensors are the only thing that's different between the checkpoints.
Although this method does execute correctly and allows me to run the training, it feels very boneheaded and I'm almost positive it is not at all what I'm supposed to do. However I'm curious as to what exactly is wrong with this (apart from initializing the new values as 0.0), as except from a wild fluctuation in loss at the start of the tranining it seems to work as I had hoped.
Other solutions
When looking further into this issue I also found other users' answers on similar questions suggesting that it is in fact impossible to modify a checkpoint in the way that I want to and suggested transfer learning or modifying the output layer. I am very new to neural network training and I don't know who to believe, as the linked answer seems to suggest that what I'm trying to do should work.
My question
So I ask: which approach is correct? Could my initial approach work, or should I try a different method to fix this issue? Starting from scratch would not be ideal, as the progress so far has taken over a month of continuous training.
I am using a modified version the TensorFlow Model Garden as my environment to train the networks, with the modificiations pertaining to creating custom datasets and using them using the supplied training scripts.

n_jobs > 1 using sklearn and pytorch is possible inside Neuraxle?

I built my own sklearn-like estimator using pytorch training inside GPU (cuda) and it works fine with RandomizedSearchCV when n_jobs==1. When n_jobs > 1, I get the following error:
PicklingError: Can't pickle <class 'main.LSTM'>: attribute lookup LSTM on main failed
This is the piece of code giving me the error:
model = my_model(input_size=1, hidden_layer_size=80, n_lstm_units=3, bidirectional=False,
output_size=1, training_batch_size=60, epochs=7500, device=device)
model.to(device)
hidden_layer_size = random.uniform(40, 200, 20).astype("int")
n_lstm_units = arange(1, 4)
parametros = {'hidden_layer_size': hidden_layer_size, 'n_lstm_units': n_lstm_units}
splitter = ShuffleSplit()
regressor = model
cv_search = \
RandomizedSearchCV(estimator=regressor, cv=splitter,
search_spaces=parametros,
refit=True,
n_iter=4,
verbose=1,
n_jobs=2,
scoring=make_scorer(mean_squared_error,
greater_is_better=False,
needs_proba=False))
cv_search = MetaSKLearnWrapper(cv_search)
cv_search.fit(X, y)
Using Neuraxle wrapper leads to exactly same error, changes nothing.
I found closest solution here, but still don't know how to use RandomizedSearchCV within Neuraxle. It is a brand new project, so I couldn't find an answer on their docs or community examples. If anyone can give me an example or a good indication it will save my life. Thank you
Ps: Any way to run RandomizedSearchCV with my pytorch model on the gpu without Neuraxle also helps, I just need n_jobs>1.
Ps2: My model has a fit() method that creates and moves tensors to the gpu and works already tested.

There are multiple criteria that must be respected here for your code to work:
You need to use Neuraxle's RandomSearch instead of sklearn's random search for this to work. Use Neuraxle's base classes when possible.
Make sure that you use a Neuraxle BaseStep for your pytorch model, instead of a sklearn base classe.
Also, you should create your PyTorch code only in the setup() method or later. You can't create a PyTorch model in the __init__ of the BaseStep that contains pytorch code. You will want to read this page.
You will probably have to create a Saver for your BaseStep that contains PyTorch code if you want to serialize and then load your trained pipeline again. You can see how we created our TensorFlow Saver for our TensorFlow BaseStep and do something similar. Your saver will probably be much simpler than ours due to the more eager nature of PyTorch. For instance, you could have self.model inside your extension of the BaseStep class. The role of the saver would be to save and strip away this simple variable from self, and to be able to reload it whenever needed.
To sum up: you'd need to create two classes, and your two classes should look very similar our two TensorFlow step and saver classes here, to the exception that you PyTorch model is in a self.model variable of your step.
I'd be glad to see your implementation of your PyTorch base step and of your PyTorch saver!
You could then also even use the AutoML class (see AutoML example here) to save experiments in a Hyperparameter Repository as seen in the example.

Unable to load and use multiple keras models

I'm trying to load three different models in the same process. Only the first one works as expected, the rest of them return like random results.
Basically the order is as follows:
define and compile first model
load trained weights before
rename layers
the same process for the second model
the same process for the third model
So, something like:
model1 = Model(inputs=Input(shape=input_size_im) , outputs=layers_firstmodel)
model1.compile(optimizer='sgd', loss='mse')
model1.load_weights(weights_first, by_name=True)
# rename layers but didn't work
model2 = Model(inputs=Input(shape=input_size_im) , outputs=layers_secondmodel)
model2.compile(optimizer='sgd', loss='mse')
model2.load_weights(weights_second, by_name=True)
# rename layers but didn't work
model3 = Model(inputs=Input(shape=input_size_im) , outputs=layers_thirdmodel)
model3.compile(optimizer='sgd', loss='mse')
model3.load_weights(weights_third, by_name=True)
# rename layers but didn't work
for im in list_images:
results_firstmodel = model1.predict(im)
results_secondmodel = model2.predict(im)
results_thirdmodel = model2.predict(im)
I'd like to perform some inference over a bunch of images. To do that the idea consists in looping over the images and perform inference with these three algorithms, and return the results.
I have tried to rename all layers to make them unique with no success. Also I created a different graph for each network, and with a different session do the inference. This works but it's very inefficient (in addition I have to set their weights every time because of sess.run(tf.global_variables_initializer()) removes them). Each time it's created a session tensorflow prints "creating tensorflow device (/device:GPU:0)".
I am running Tensorflow 1.4.0-rc0, Keras 2.1.1 and Ubuntu 16.04 kernel 4.14.

The OP is correct here. There is a serious bug when you try to load multiple weight files in the same script. The above answer doesn't solve this. If you actually interrogate the weights when loading weights for multiple models in the same script you will notice that the weights are different than when you just load weights for one model on its own. This is where the randomness is the OP observes coming from.
EDIT: To solve this problem you have to encapsulate the model.load_weight command within a function and the randomness that you are experiencing should go away. The problem is that something weird screws up when you have multiple load_weight commands in the same script like you have above. If you load those model weights with a function you issues should go away.

From the Keras docs we have this explanation for the user of load_weights:
loads the weights of the model from a HDF5 file (created by save_weights). By default, the architecture is expected to be unchanged. To load weights into a different architecture (with some layers in common), use by_name=True to load only those layers with the same name.
Therefore, if your architecture is unchanged you should drop the by_name=True or make it False (its default value). This could be causing the inconsistencies that you are facing, as your weights are not being loaded probably due to having different names on your layers.
Another important thing to consider is the nature of your HDF5 file, and the way you created it. If it indeed contains only the weights (created with save_weights as the docs point out) then there should be no problem in proceeding as explained before.
Now, if that HDF5 contains weights and architecture in the same file, then you should be loading it with keras.models.load_model instead (further reading if you like here). If this is the case then this would also explain those inconsistencies.
As a side suggestion, I prefer to save my models using Callbacks, like the ModelCheckpoint or the EarlyStopping if you want to automatically determine when to stop training. This not only gives you greater flexibility when training and saving your models (as you can stop them on the optimal training epoch or when you desire), but also makes loading those models easily, as you can simply use the load_model method to load both architecture and weights to your desired variable.
Finally, here is one useful SO post where saving (and loading) Keras models is explained.

Weights and Bias from Trained Meta Graph

I have successfully exported a re-trained InceptionV3 NN as a TensorFlow meta graph. I have read this protobuf back into python successfully, but I am struggling to see a way to export each layers weight and bias values, which I am assuming is stored within the meta graph protobuf, for recreating the nn outside of TensorFlow.
My workflow is as such:
Retrain final layer for new categories
Export meta graph tf.train.export_meta_graph(filename='model.meta')
Build python pb2.py using Protoc and meta_graph.proto
Load Protobuf:
import meta_graph_pb2
saved = meta_graph_pb2.CollectionDef()
with open('model.meta', 'rb') as f:
saved.ParseFromString(f.read())
From here I can view most aspects of the graph, like node names and such, but I think my inexperience is making it difficult to track down the correct way to access the weight and bias values for each relevant layer.

The MetaGraphDef proto doesn't actually contain the values of the weights and biases. Instead it provides a way to associate a GraphDef with the weights stored in one or more checkpoint files, written by a tf.train.Saver. The MetaGraphDef tutorial has more details, but the approximate structure is as follows:
In you training program, write out a checkpoint using a tf.train.Saver. This will also write a MetaGraphDef to a .meta file in the same directory.
saver = tf.train.Saver(...)
# ...
saver.save(sess, "model")
You should find files called model.meta and model-NNNN (for some integer NNNN) in your checkpoint directory.
In another program, you can import the MetaGraphDef you just created, and restore from a checkpoint.
saver = tf.train.import_meta_graph("model.meta")
saver.restore("model-NNNN") # Or whatever checkpoint filename was written.
If you want to get the value of each variable, you can (for example) find the variable in tf.all_variables() collection and pass it to sess.run() to get its value. For example, to print the values of all variables, you can do the following:
for var in tf.all_variables():
print var.name, sess.run(var)
You could also filter tf.all_variables() to find the particular weights and biases that you're trying to extract from the model.

Restoring graph in tensorflow fails because there is no variable to save

I know that there are countless questions on stack and github, etc. on how to restore a trained model in Tensorflow. I have read most of them (1,2,3).
I have almost exactly the same problem as 3 however I would like if possible to solve it in a different fashion as my training and my test need to be in separate scripts called from the shell and I do not want to add the exact same lines I used to define the graph in the test script so I cannot use tensorflow FLAGS and the other answers based on reruning the graph by hand.
I also do not want to sess.run every variables and manually map them by hands as it was explained as my graph is quite big (Using import_graph_def with the arguments input_map).
So I run some graph and train it in a specific script. Like for instance (but without the training part)
#Script 1
import tensorflow as tf
import cPickle as pickle
x=tf.Variable(42)
saver=tf.train.Saver()
sess=tf.Session()
#Saving the graph
graph_def=sess.graph_def
with open('graph.pkl','wb') as output:
pickle.dump(graph_def,output,HIGHEST_PROTOCOL)
#Training the model
sess.run(tf.initialize_all_variables())
#Saving the variables
saver.save(sess,"pretrained_model.ckpt")
I now have both graph and variables saved so I should be able to run my test model from another script even if I have extra training nodes in my graph.
#Script 2
import tensorflow as tf
import cPickle as pickle
sess=tf.Session()
with open('graph.pkl','rb') as input:
graph_def=pickle.load(input)
tf.import_graph_def(graph_def,name='persisted')
Then obviously I want to restore the variables using a saver but I encounter the same problem as 3 as there are no variables found to save to even create a saver. So I cannot write:
saver=tf.train.Saver()
saver.restore(sess,"pretrained_model.ckpt")
Is there a way to bypass those limitations ? I thought by importing graph it would recover the uninitialized variables in every node but it seems not. Do I really need to rerun it a second time like most of the answers given ?

The list of variables is saved in a Collection which is not saved in the GraphDef. Saver by default uses the list in ops.GraphKeys.VARIABLES collection (accessible through tf.all_variables()), and if you restored from GraphDef rather than using Python API to build your model, that collection is empty. You could specify the list of variables manually in tf.train.Saver(var_list=['MyVariable1:0', 'MyVariable2:0',...]).
Alternatively instead of GraphDef you could use MetaGraphDef which saves collections, there's a recently added MetaGraphDef HowTo

To my knowledge and my tests you can't simply pass names to tf.train.Saver object. It must be either list of variables o dictionary.
I would also like to read model from graph_def and then load variables using saver - however attempting it results only in error message: "Variable to save is not a variable"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.