Using tf.data.Dataset makes saved model bigger - python

I recently have an issue with saving the model in a bigger size.
I am using tensorflow 1.4
Before, I used
tf.train.string_input_producer() and tf.train.batch()
to load images from a text file. And in the training,
tf.train.start_queue_runners() and tf.train.Coordinator()
were used to provide data to the network. In this case, every time I saved the model using
saver.save(sess, checkpoint_path, global_step=iters)
only gave me a small size file, i.e. a file named model.ckpt-1000.data-00000-of-00001 with 1.6MB.
Now, I use
tf.data.Dataset.from_tensor_slices()
to supply images to an input placeholder and the saved model become 290MB. But I don't know why. I suspect the tensorflow saver saved the dataset into the model as well. If so, how to remove them to make it smaller, and only the weights of the network are saved.
This is not network depended because I tried in two networks and they were all like that.
I have googled but unfortunately didn't see any inspiration related to this issue. (Or this is not an issue, just I don't know how do?)
Thank you very much for any idea and help!
Edit
The method I initialised the dataset is:
1.First generated numpy.array dataset:
self.train_hr, self.train_lr = cifar10.load_dataset(sess)
The initial dataset is numpy.array, for example [8000,32,32,3]. I passed sess into this function is because in the function, I did tf.image.resize_images() and use sess.run() to generate numpy.array. The returns self.train_hr and self.train_lr are numpy.array in shape [8000,64,64,3].
2.Then I created the dataset:
self.img_hr = tf.placeholder(tf.float32)
self.img_lr = tf.placeholder(tf.float32)
dataset = tf.data.Dataset.from_tensor_slices((self.img_hr, self.img_lr))
dataset = dataset.repeat(conf.num_epoch).shuffle(buffer_size=conf.shuffle_size).batch(conf.batch_size)
self.iterator = dataset.make_initializable_iterator()
self.next_batch = self.iterator.get_next()
3.Then I initialised network and dataset, did the training and saved model:
self.labels = tf.placeholder(tf.float32,
shape=[conf.batch_size, conf.hr_size, conf.hr_size, conf.img_channel])
self.inputs = tf.placeholder(tf.float32,
shape=[conf.batch_size, conf.lr_size, conf.lr_size, conf.img_channel])
self.net = Net(self.labels, self.inputs, mask_type=conf.mask_type,
is_linear_only=conf.linear_mapping_only, scope='sr_spc')
sess.run(self.iterator.initializer,
feed_dict={self.img_hr: self.train_hr, self.img_lr: self.train_lr})
while True:
hr_img, lr_img = sess.run(self.next_batch)
_, loss, summary_str = sess.run([train_op, self.net.loss, summary_op],
feed_dict={self.labels: hr_img, self.inputs: lr_img})
...
...
checkpoint_path = os.path.join(conf.model_dir, 'model.ckpt')
saver.save(sess, checkpoint_path, global_step=iters)
All the sess are the same instance.

I suspect you created a tensorflow constant tf.constant out of your dataset, which would explain why the dataset gets stored with the graph. There is an initializeable dataset which let's you feed in the data using feed_dict at runtime. It's a few extra lines of code to configure but it's probably what you wanted to use.
https://www.tensorflow.org/programmers_guide/datasets
Note that constants get created for you automatically in the Python wrapper. The following statements are equivalent:
tf.Variable(42)
tf.Variable(tf.constant(42))

Tensorflow indeed saves your dataset. To solve it, lets understand why.
How tensorflow works and what does it save?
In short, Tensorflow API lets you build a computation graph via code, and then optimize it. Every op/variable/constant you define in the graph is working on tensors and is part of that graph. This framework is convenient since Tensorflow just build a graph, then the framework decides (or you specify) where to compute the graph in order to gain maximum speed out of your hardware, for instance, by computing on your GPU.
The GPU is a great example since this is a great example for your issue. Sending data from HDD/RAM/Processor to GPU is expensive time-wise. Therefore, Tensorflow also allow you to create input producers that will pretty much automatically manage the data transferred between all peripheral units, by queuing them and managing threads. However, I haven't seen much gain from that approach. Note that the inputs produced by datasets are also tensors, specifically constants/variables that are used as input to the network.. Therefore, they are part of the graph.
When saving a graph, we save several things:
Metadata - which defines the graph and its structure.
Values - of each variable/constant in the graph, in order to load it and reuse the network.
When you use datasets, the values of the non-trainable variables are saved, and therefore, your checkpoint file is larger.
To better understand datasets, see its implementation in the package files.
TL;DR - How do I fix my problem?
If its not reducing performance, use feeding dictionary to feed placeholders. Do not use tensors to store your data. This way those variables will not be saved.
Save only tensors that you would like to load (weights, biases, etc). You can user .eval() method to find its values, save it as JSON or such, and load it later by reconstructing the graph.
Good luck!

I solved this issue (not perfectly as I still don't know where the problem happens). Instead, I made a workaround to avoid saving a large amount of data.
I defined a saver fed in a specific list of variables. That list only contains the nodes of my graph. Here I show a small example of my workaround:
import tensorflow as tf
v1= tf.Variable(tf.random_normal([784, 200], stddev=0.35), name="v1")
v2= tf.Variable(tf.zeros([200]), name="v2")
saver = tf.train.Saver( [v2])
# saver = tf.train.Saver()
with tf.Session() as sess:
init_op = tf.global_variables_initializer()
sess.run(init_op)
saver.save(sess,"checkpoint/model_test",global_step=1)
v2 is the variable list. Or you can use variables_list = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='net') to collect all the nodes.

Related

Increasing number of classes on TensorFlow checkpoint

For a project I'm working on I have to compare the performance of a small, simulated dataset to the performance of well-known datasets (e.g. ImageNet, CIFAR-10) using two neural networks, ResNet v2 at different depths and Inception v3.
To account for the imbalance between my set and the big popular ones, one of the methods I want to employ is augmenting the big sets using either a class from my simulated set or the same class but instead with real-life images. Then, using a pre-trained model, I want to train on both augmented sets and later compare performance.
The issue
The augmentation works fine: I copy over the .tfrecord files containing only images of the extra class to the original dataset and I modify the labels.txt file to contain the appropriate extra labels. However, the issue occurs when trying to train on this augmented dataset:
I0125 16:40:31.279634 140094914221888 coordinator.py:224] Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Assign requires shapes of both tensors to match. lhs shape= [258] rhs shape= [257]
which is to be expected, as the graph cannot handle the extra class.
My attempted solution
To alleviate this issue I tried applying the same method as was posed in this answer, using the Python script below (modified for brevity):
checkpoint = {}
saver = tf.compat.v1.train.import_meta_graph(f"{TRAIN_PATH}/model.ckpt-{CHECKPOINT_NUM}.meta")
with tf.Session() as sess:
saver.restore(sess, f"{TRAIN_PATH}/model.ckpt-{CHECKPOINT_NUM}")
# get appropriate nodes where number of classes is one of its dimensions
if "inception_v3" in TRAIN_PATH:
target_nodes = [
'<nodes where one of its dimensions is equal to the number of classes>'
]
elif "resnet_v2" in TRAIN_PATH:
target_nodes = [
'<nodes where one of its dimensions is equal to the number of classes>'
]
# select the nodes previously defined
nodes = [node for node in tf.global_variables() if node.name in target_nodes]
for node in nodes:
# load the values of those nodes from the checkpoint
checkpoint[node.name] = node.eval()
# extend dimensions for extra class
checkpoint.update({node.name: np.insert(checkpoint[node.name], checkpoint[node.name].shape[-1], 0.0, axis=-1)})
# assign new nodes to the checkpoint
sess.run(tf.assign(node, checkpoint[node.name], validate_shape=False))
saver.save(sess, f"{TRAIN_PATH}/model.ckpt-{CHECKPOINT_NUM}")
In short, this script loads the checkpoint and isolates all nodes where one of the dimensions is equal to the number of classes and increases it. I have verified these are always equal to the number of classes using the checkpoints from the different datasets the models have been trained on. In fact, it seems that those tensors are the only thing that's different between the checkpoints.
Although this method does execute correctly and allows me to run the training, it feels very boneheaded and I'm almost positive it is not at all what I'm supposed to do. However I'm curious as to what exactly is wrong with this (apart from initializing the new values as 0.0), as except from a wild fluctuation in loss at the start of the tranining it seems to work as I had hoped.
Other solutions
When looking further into this issue I also found other users' answers on similar questions suggesting that it is in fact impossible to modify a checkpoint in the way that I want to and suggested transfer learning or modifying the output layer. I am very new to neural network training and I don't know who to believe, as the linked answer seems to suggest that what I'm trying to do should work.
My question
So I ask: which approach is correct? Could my initial approach work, or should I try a different method to fix this issue? Starting from scratch would not be ideal, as the progress so far has taken over a month of continuous training.
I am using a modified version the TensorFlow Model Garden as my environment to train the networks, with the modificiations pertaining to creating custom datasets and using them using the supplied training scripts.

Why does tf.feature_column.input_layer return my feature values in rearranged order

I currently am testing some ideas with LinearClassifier with TensorFlow's New API. Have created some features using just tf.feature_column.numeric_column and feed them into a tf.feature_column.input_layer layer as following:
layer_test = tf.feature_column.input_layer(input_fn('data.csv')[0], feature_columns_raw)
And then simply check if my input features are corrently fed via:
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
layer_test_val = sess.run(layer_test)
print(layer_test_val)
But I found if very strange and counterintuitive that column order in layer_test_val were completely changed out of my control. It's no big deal if the LinearClassifier can eventually get the decent prediction accuracy despite the rearranged features' order. However, if what I need is the learned weights of those features, how can I make sure that the output of
weights = lr_classifier.get_variable_value('logits/kernel') is in the exact order of the features as they are specified

How to grab one tensor from an existing model and use it in another one?

What I want to do is to grab some weights and biases from an existing trained model, and then use them in my customized op (model or graph).
I can restore model with:
# Create context
with tf.Graph().as_default(), tf.Session() as sess:
# Create model
with tf.variable_scope('train'):
train_model = MyModel(some_args)
And then grab tensor:
latest_ckpt = tf.train.latest_checkpoint(path)
if latest_ckpt:
saver.restore(sess, latest_ckpt)
weight = tf.get_default_graph().get_tensor_by_name("example:0")
My question is, if I want to use that weight in another context (model or graph), how to safely copy its value to the new graph, e.g.:
with self.test_session(use_gpu=True, graph=ops.Graph()) as sess:
with vs.variable_scope("test", initializer=initializer):
# How can I make it possible?
w = tf.get_variable('name', initializer=weight)
Any help is welcome, thank you so much.
Thanks #Sorin for the inspiration, I found a simple and clean way to do this:
z = graph.get_tensor_by_name('prefix/NN/W1:0')
with tf.Session(graph=graph) as sess:
z_value = sess.run(z)
with tf.Graph().as_default() as new_graph, tf.Session(graph=new_graph) as sess:
w = tf.get_variable('w', initializer=z_value)
The hacky way is to use tf.assign to assign the weight to the variable you want (make sure it only happens once at the begining, and not every iteration, otherwise the model won't be able to adjust those weights).
The slightly less hacky way is to load the graph and session of the trained model and modify the graph to add the operations you need. This will make the graph a bit more messy since you also have the entire graph of the original model, but it's a bit cleaner since you can depend directly on the operations instead of the weights (that is if the original model was doing a sigmoid activation, this will copy the activation as well). The unused parts of the graph will be automatically pruned by tensorflow.
The clean way to do it is to use www.tenforflow.com/hub . It's a library that allows you to define parts of the graph as modules that you can export and import into any graph. This will handle all dependencies and configuration and also gives you nice controls over the training (i.e. if you want to freeze the weights, or delay the training for some number of iterations, etc.)

Unable to load and use multiple keras models

I'm trying to load three different models in the same process. Only the first one works as expected, the rest of them return like random results.
Basically the order is as follows:
define and compile first model
load trained weights before
rename layers
the same process for the second model
the same process for the third model
So, something like:
model1 = Model(inputs=Input(shape=input_size_im) , outputs=layers_firstmodel)
model1.compile(optimizer='sgd', loss='mse')
model1.load_weights(weights_first, by_name=True)
# rename layers but didn't work
model2 = Model(inputs=Input(shape=input_size_im) , outputs=layers_secondmodel)
model2.compile(optimizer='sgd', loss='mse')
model2.load_weights(weights_second, by_name=True)
# rename layers but didn't work
model3 = Model(inputs=Input(shape=input_size_im) , outputs=layers_thirdmodel)
model3.compile(optimizer='sgd', loss='mse')
model3.load_weights(weights_third, by_name=True)
# rename layers but didn't work
for im in list_images:
results_firstmodel = model1.predict(im)
results_secondmodel = model2.predict(im)
results_thirdmodel = model2.predict(im)
I'd like to perform some inference over a bunch of images. To do that the idea consists in looping over the images and perform inference with these three algorithms, and return the results.
I have tried to rename all layers to make them unique with no success. Also I created a different graph for each network, and with a different session do the inference. This works but it's very inefficient (in addition I have to set their weights every time because of sess.run(tf.global_variables_initializer()) removes them). Each time it's created a session tensorflow prints "creating tensorflow device (/device:GPU:0)".
I am running Tensorflow 1.4.0-rc0, Keras 2.1.1 and Ubuntu 16.04 kernel 4.14.
The OP is correct here. There is a serious bug when you try to load multiple weight files in the same script. The above answer doesn't solve this. If you actually interrogate the weights when loading weights for multiple models in the same script you will notice that the weights are different than when you just load weights for one model on its own. This is where the randomness is the OP observes coming from.
EDIT: To solve this problem you have to encapsulate the model.load_weight command within a function and the randomness that you are experiencing should go away. The problem is that something weird screws up when you have multiple load_weight commands in the same script like you have above. If you load those model weights with a function you issues should go away.
From the Keras docs we have this explanation for the user of load_weights:
loads the weights of the model from a HDF5 file (created by save_weights). By default, the architecture is expected to be unchanged. To load weights into a different architecture (with some layers in common), use by_name=True to load only those layers with the same name.
Therefore, if your architecture is unchanged you should drop the by_name=True or make it False (its default value). This could be causing the inconsistencies that you are facing, as your weights are not being loaded probably due to having different names on your layers.
Another important thing to consider is the nature of your HDF5 file, and the way you created it. If it indeed contains only the weights (created with save_weights as the docs point out) then there should be no problem in proceeding as explained before.
Now, if that HDF5 contains weights and architecture in the same file, then you should be loading it with keras.models.load_model instead (further reading if you like here). If this is the case then this would also explain those inconsistencies.
As a side suggestion, I prefer to save my models using Callbacks, like the ModelCheckpoint or the EarlyStopping if you want to automatically determine when to stop training. This not only gives you greater flexibility when training and saving your models (as you can stop them on the optimal training epoch or when you desire), but also makes loading those models easily, as you can simply use the load_model method to load both architecture and weights to your desired variable.
Finally, here is one useful SO post where saving (and loading) Keras models is explained.

Loading SavedModel is a lot slower than loading a tf.train.Saver checkpoint

I changed from tf.train.Saver to the SavedModel format which surprisingly means loading my model from disk is a lot slower (instead of a couple of seconds it takes minutes). Why is this and what can I do to load the model faster?
I used to do this:
# Save model
saver = tf.train.Saver()
save_path = saver.save(session, model_path)
# Load model
saver = tf.train.import_meta_graph(model_path + '.meta')
saver.restore(session, model_path)
But now I do this:
# Save model
builder = tf.saved_model.builder.SavedModelBuilder(model_path)
builder.add_meta_graph_and_variables(session, [tf.saved_model.tag_constants.TRAINING])
builder.save()
# Load model
tf.saved_model.loader.load(session, [tf.saved_model.tag_constants.TRAINING], model_path)
I am by no ways an expert in Tensorflow, but if I had to take a guess as to why this is happening, I would say that:
tf.train.Saver(), saves a complete meta-graph. Therefore, all the information needed to perform any operations contained in your graph is already there. All tensorflow needs to do to load the model, is insert the meta-graph into the default/current graph and you're good to go.
The SavedModelBuilder() on the other hand, behind the scene creates a language agnostic representation of your operations and variables. Which means that the loading method has to extract all the information, then recreate all the operation and variables from your previous graph, and insert them into the default/current graph.
Depending on the size of your graph, recreating everything that it contained might take some time.
Concerning the second question, as #J H said, if there are no reasons for you to use one strategy over the other, and time is of the essence, then just go with the fastest one.
what can I do to load the model faster?
Switch back to tf.train.Saver, as your question shows no motivations for using SavedModelBuilder, and makes it clear that elapsed time matters to you. Alternatively, an MCVE that reproduced the timing issue would allow others to collaborate with you on profiling, diagnosing, and fixing any perceived performance issue.

Categories

Resources