Saver.save is getting slower and slower in each fold

Saver.save is getting slower and slower in each fold - python

I am using tensorflow and I have developer a deep multilayer feedforward model. To be sure about the performance of the model, I decided to use it in 10-fold cross validation. In each fold I create a new instance of the neural network, call the train and the predict functions.
In each fold I call the following codes:
for each fold:
nn= ffNN(hidden_nodes, epochs, learning_rate, saveFrequency, save_path, decay, decay_step, decay_factor, stop_loss, keep_probability, regularization_factor,minimum_cost,activation_function,batch_size,shuffle,stopping_iteration)
nn.initialize(x_size)
nn.train(X,y)
nn.predict(X_test)
in ffNN file I have the initialization and train and predict functions as follow:
nn.train:
sess = tf.InteractiveSession()
init = tf.global_variables_initializer()
sess.run(init)
saver = tf.train.Saver()
for each epoch:
for each batch:
_ , loss = session.run([self.optimizer,self.loss],feed_dict={self.X:X1, self.y:y})
if epoch % save_frequency == 0:
saver.save(session,save_path)
sess.close()
The problem is in saver.save, in each fold it takes longer and longer to save. Although I create all of the variables from the scratch, I don't know what is making it dependent on the folds and make the saving takes longer and longer.
Thanks in advance.
Edit:
The code for building the model nn.initialize is as follow:
self.X = tf.placeholder("float", shape=[None, x_size], name='XValue')
self.y = tf.placeholder("float", shape=[None, y_size], name='yValue')
with tf.variable_scope("initialization", reuse=tf.AUTO_REUSE):
w_in, b_in = init_weights((x_size, self.hidden_nodes))
h_out = self.forwardprop(self.X, w_in, b_in, self.keep_prob,self.activation_function)
l2_norm = tf.add(tf.nn.l2_loss(w_in), tf.nn.l2_loss(b_in))
w_out, b_out = init_weights((self.hidden_nodes, y_size))
l2_norm = tf.add(tf.nn.l2_loss(w_out), l2_norm)
l2_norm = tf.add(tf.nn.l2_loss(b_out), l2_norm)
self.yhat = tf.add(tf.matmul(h_out, w_out), b_out)
self.mse = tf.losses.mean_squared_error(labels=self.y, predictions=self.yhat)
self.loss = tf.add(self.mse,self.regularization_factor * l2_norm)
self.optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate).minimize(self.loss)

Based on what you described in the question the problem is not in saver.save, but the computational graph getting bigger and bigger instead. Thus, the saving takes more time. Make sure to structure the code in the following way:
for each fold:
# Clear the previous computational graph
tf.reset_default_graph()
# Then build the graph
nn = ffNN()
# Create the saver
saver = tf.train.Saver()
# Create a session
with tf.Session() as sess:
# Initialize the variables in the graph
sess.run(tf.global_variables_initializer())
# Train the model
for each epoch:
for each batch:
nn.train_on_batch()
if epoch % save_frequency == 0:
saver.save(sess,save_path)

Related

What mean neural network is executed in tensorflow

TensorFlow graph API separates graph construction and execution. Because of this, I can't understand in which line neural network is executed.
"""
- model_fn: function that performs the forward pass of the model
- init_fn: function that initializes the parameters of the model.
- learning_rate: the learning rate to use for SGD.
"""
tf.reset_default_graph()
is_training = tf.placeholder(tf.bool, name='is_training')
with tf.device(device):
x = tf.placeholder(tf.float32, [None, 32, 32, 3])
y = tf.placeholder(tf.int32, [None])
params = init_fn() # Initialize the model parameters
scores = model_fn(x, params) # Forward pass of the model
loss = training_step(scores, y, params, learning_rate) # SGD
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for t, (x_np, y_np) in enumerate(train_dset):
feed_dict = {x: x_np, y: y_np}
loss_np = sess.run(loss, feed_dict=feed_dict)

As described in the Tensorflow documentation: (https://www.tensorflow.org/api_docs/python/tf/Session#run)
This method runs one "step" of TensorFlow computation, by running the
necessary graph fragment to execute every Operation and evaluate every
Tensor
In you example, sess.run(tf.global_variables_initializer()) run the initilization operation that create all the weights and tensors, loss_np = sess.run(loss, feed_dict=feed_dict) executes all the operations up to loss.
I hope this answers your question

How to load a trained tensorflow model

I am having all kinds of trouble loading a tensorflow model to test on some new data. When I trained the model, I used this:
save_model_file = 'my_saved_model'
saver = tf.train.Saver()
save_path = saver.save(sess, save_model_file)
This seems to result in the following files being created:
my_saved_model.meta
checkpoint
my_saved_model.index
my_saved_model.data-00000-of-00001
I have no idea which of these files I am supposed to pay attention to.
Now the model is trained, and I can't seem to load it or use it without throwing an exception. Here is what I am doing:
def neural_net_data_input(data_shape):
theshape=(None,)+tuple(data_shape)
return tf.placeholder(tf.float32,shape=theshape,name='x')
def neural_net_label_input(n_out):
return tf.placeholder(tf.float32,shape=(None,n_out),name='one_hot_labels')
def neural_net_keep_prob_input():
return tf.placeholder(tf.float32,name='keep_prob')
def do_generate_network(x):
#
# here is where i generate the network layer by layer.
# this code works fine so i am not showing it here
#
pass
#
# Now I want to restore the model
#
tf.reset_default_graph()
input_data_shape=(32,32,1)
final_num_outputs=43
graph1 = tf.Graph()
with graph1.as_default():
x = neural_net_data_input(input_data_shape)
one_hot_labels = neural_net_label_input(final_num_outputs)
keep_prob=neural_net_keep_prob_input()
logits = do_generate_network(x)
# Name logits Tensor, so that is can be loaded from disk after training
logits = tf.identity(logits, name='logits')
#
# accuracy: we use this for validation testing
#
correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(one_hot_labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32), name='accuracy')
################################
# Evaluate
################################
new_data=myutils.load_pickle_file(SOME_DATA_FILE_NAME)
new_features=new_data['features']
new_one_hot_labels=new_data['labels']
print('Evaluating on new data...')
with tf.Session(graph=graph1) as sess:
# Initializing the variables
sess.run(tf.global_variables_initializer())
saver.restore(sess,save_model_file)
new_acc = sess.run(accuracy, feed_dict={x: new_features, one_hot_labels: new_one_hot_labels, keep_prob: 1.})
print('Testing Accuracy For New Images: {}'.format(new_acc))
But when I do this, I get this:
TypeError: Cannot interpret feed_dict key as Tensor: The name 'save/Const:0' refers to a Tensor which does not exist. The operation, 'save/Const', does not exist in the graph.
So, i tried moving my graph inside the session like this:
################################
# Evaluate
################################
print('Evaluating on web data...')
with tf.Session() as sess:
x = neural_net_data_input(input_data_shape)
one_hot_labels = neural_net_label_input(final_num_outputs)
keep_prob=neural_net_keep_prob_input()
logits = do_generate_network(x)
# Name logits Tensor, so that is can be loaded from disk after training
logits = tf.identity(logits, name='logits')
#
# accuracy: we use this for validation testing
#
correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(one_hot_labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32), name='accuracy')
sess.run(tf.global_variables_initializer())
my_save_dir="/home/carnd/CarND-Traffic-Sign-Classifier-Project"
load_model_meta_file=os.path.join(my_save_dir,"my_saved_model.meta")
load_model_path=os.path.join(my_save_dir,"my_saved_model")
new_saver = tf.train.import_meta_graph(load_model_meta_file)
new_saver.restore(sess, load_model_path)
web_acc = sess.run(accuracy, feed_dict={x: web_features, one_hot_labels: web_one_hot_labels, keep_prob: 1.})
print('Testing Accuracy For Web Images: {}'.format(web_acc))
Now it runs without throwing an error, but the accuracy result it prints is 0.02! I am feeding in the very same data that during training I was getting 95% accuracy on. So it appears I am somehow loading my model incorrectly.
What am I doing wrong?

Steps for loading the trained model:
Load the graph:
You can load the graph using tf.train.import_meta_graph(). An example code would be:
model_path = "my_saved_model"
inference_graph = tf.Graph()
with tf.Session(graph= inference_graph) as sess:
# Load the graph with the trained states
loader = tf.train.import_meta_graph(model_path+'.meta')
loader.restore(sess, model_path)
Get the tensors: Get the tensors need for inference by using get_tensor_by_name(). So in your model make sure you name the tensors by name, so that you can call it during inference.
#Get the tensors by their variable name
_accuracy = inference_graph.get_tensor_by_name('accuracy:0')
_x = inference_graph get_tensor_by_name('x:0')
_y = inference_graph.get_tensor_by_name('y:0')
Test: Can do done by using the tensors loaded. sess.run(_accuracy, feed_dict={_x: ... , _y:...}

Tensorflow: Number of nodes in the graph keeps increasing as training goes on

I'm training a convolutional model in tensorflow. After training the model for about 70 epochs, which took almost 1.5 hrs, I couldn't save the model. It gave me ValueError: GraphDef cannot be larger than 2GB. I found that as the training proceeds the number of nodes in my graph keeps increasing.
At epochs 0,3,6,9, the number of nodes in the graph are 7214, 7238, 7262, 7286 respectively. When I use with tf.Session() as sess:, instead of passing the session as sess = tf.Session(), the number of nodes are 3982, 4006, 4030, 4054 at epochs 0,3,6,9 respectively.
In this answer, it is said that as nodes get added to the graph, it can exceed its maximum size. I need help with understanding how the number of nodes keep going up in my graph.
I train my model using the code below:
def runModel(data):
'''
Defines cost, optimizer functions, and runs the graph
'''
X, y,keep_prob = modelInputs((755, 567, 1),4)
logits = cnnModel(X,keep_prob)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y), name="cost")
optimizer = tf.train.AdamOptimizer(.0001).minimize(cost)
correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(y, 1), name="correct_pred")
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32), name='accuracy')
sess = tf.Session()
sess.run(tf.global_variables_initializer())
saver = tf.train.Saver()
for e in range(12):
batch_x, batch_y = data.next_batch(30)
x = tf.reshape(batch_x, [30, 755, 567, 1]).eval(session=sess)
batch_y = tf.one_hot(batch_y,4).eval(session=sess)
sess.run(optimizer, feed_dict={X: x, y: batch_y,keep_prob:0.5})
if e%3==0:
n = len([n.name for n in tf.get_default_graph().as_graph_def().node])
print("No.of nodes: ",n,"\n")
current_cost = sess.run(cost, feed_dict={X: x, y: batch_y,keep_prob:1.0})
acc = sess.run(accuracy, feed_dict={X: x, y: batch_y,keep_prob:1.0})
print("At epoch {epoch:>3d}, cost is {a:>10.4f}, accuracy is {b:>8.5f}".format(epoch=e, a=current_cost, b=acc))
What causes an increase in the number of nodes?

You are creating new nodes within your training loop. In particular, you are calling tf.reshape and tf.one_hot, each of which creates one (or more) nodes. You can either:
Create those nodes outside of the graph using placeholders as inputs, and then only evaluate them in the loop.
Not use TensorFlow for those operations and use instead NumPy or equivalent operations.
I would recommend the second one, since there does not seem to be any benefit in using TensorFlow for data preparation. You can have something like:
import numpy as np
# ...
x = np.reshape(batch_x, [30, 755, 567, 1])
# ...
# One way of doing one-hot encoding with NumPy
classes_arr = np.arange(4).reshape([1] * batch_y.ndims + [-1])
batch_y = (np.expand_dims(batch_y, -1) == classes_arr).astype(batch_y.dtype)
# ...
PD: I'd also recommend using tf.Session() in a with context manager to make sure its close() method is called at the end (unless you want to keep using the same session later).

Another option, that solved a similar problem for me, is to use tf.reset_default_graph()

Issue with Tensorflow save and restore model

I am trying to use the Transfer Learning approach. Here is a snapshot for the code where my code is learning over the Training data :
max_accuracy = 0.0
saver = tf.train.Saver()
for epoch in range(epocs):
shuffledRange = np.random.permutation(n_train)
y_one_hot_train = encode_one_hot(len(classes), Y_input)
y_one_hot_validation = encode_one_hot(len(classes), Y_validation)
shuffledX = X_input[shuffledRange,:]
shuffledY = y_one_hot_train[shuffledRange]
for Xi, Yi in iterate_mini_batches(shuffledX, shuffledY, mini_batch_size):
sess.run(train_step,
feed_dict={bottleneck_tensor: Xi,
ground_truth_tensor: Yi})
# Every so often, print out how well the graph is training.
is_last_step = (i + 1 == FLAGS.how_many_training_steps)
if (i % FLAGS.eval_step_interval) == 0 or is_last_step:
train_accuracy, cross_entropy_value = sess.run(
[evaluation_step, cross_entropy],
feed_dict={bottleneck_tensor: Xi,
ground_truth_tensor: Yi})
validation_accuracy = sess.run(
evaluation_step,
feed_dict={bottleneck_tensor: X_validation,
ground_truth_tensor: y_one_hot_validation})
print('%s: Step %d: Train accuracy = %.1f%%, Cross entropy = %f, Validation accuracy = %.1f%%' %
(datetime.now(), i, train_accuracy * 100, cross_entropy_value, validation_accuracy * 100))
result_tensor = sess.graph.get_tensor_by_name(ensure_name_has_port(FLAGS.final_tensor_name))
probs = sess.run(result_tensor,feed_dict={'pool_3/_reshape:0': Xi[0].reshape(1,2048)})
if validation_accuracy > max_accuracy :
saver.save(sess, 'models/superheroes_model')
max_accuracy = validation_accuracy
print(probs)
i+=1
Here is where my code, where I am loading the model :
def load_model () :
sess=tf.Session()
#First let's load meta graph and restore weights
saver = tf.train.import_meta_graph('models/superheroes_model.meta')
saver.restore(sess,tf.train.latest_checkpoint('models/'))
sess.run(tf.global_variables_initializer())
result_tensor = sess.graph.get_tensor_by_name(ensure_name_has_port(FLAGS.final_tensor_name))
X_feature = features[0].reshape(1,2048)
probs = sess.run(result_tensor,
feed_dict={'pool_3/_reshape:0': X_feature})
print probs
return sess
So now for the same data point I am getting totally different results while training and testing. Its not even close. During testing, my probabilities are near to 25% as I have 4 classes. But during training highest class probability is 90%.
Is there any issue while saving or restoring the model?

Be careful -- you are calling
sess.run(tf.global_variables_initializer())
after calling
saver.restore(sess,tf.train.latest_checkpoint('models/'))
I've done similar before, and I think that resets all your trained weights/biases/etc. in the restored model.
If you must, call the initializer prior to restoring the model, and if you need to initialize something specific from the restored model, do it individually.

delete sess.run(tf.global_variables_initializer()) in your function load_model, if you do it, all your trained parameters will be replaced with the initial value that will produce 1/4 probability for each class

tensorflow: saving and restoring session

I am trying to implement a suggestion from answers:
Tensorflow: how to save/restore a model?
I have an object which wraps a tensorflow model in a sklearn style.
import tensorflow as tf
class tflasso():
saver = tf.train.Saver()
def __init__(self,
learning_rate = 2e-2,
training_epochs = 5000,
display_step = 50,
BATCH_SIZE = 100,
ALPHA = 1e-5,
checkpoint_dir = "./",
):
...
def _create_network(self):
...
def _load_(self, sess, checkpoint_dir = None):
if checkpoint_dir:
self.checkpoint_dir = checkpoint_dir
print("loading a session")
ckpt = tf.train.get_checkpoint_state(self.checkpoint_dir)
if ckpt and ckpt.model_checkpoint_path:
self.saver.restore(sess, ckpt.model_checkpoint_path)
else:
raise Exception("no checkpoint found")
return
def fit(self, train_X, train_Y , load = True):
self.X = train_X
self.xlen = train_X.shape[1]
# n_samples = y.shape[0]
self._create_network()
tot_loss = self._create_loss()
optimizer = tf.train.AdagradOptimizer( self.learning_rate).minimize(tot_loss)
# Initializing the variables
init = tf.initialize_all_variables()
" training per se"
getb = batchgen( self.BATCH_SIZE)
yvar = train_Y.var()
print(yvar)
# Launch the graph
NUM_CORES = 3 # Choose how many cores to use.
sess_config = tf.ConfigProto(inter_op_parallelism_threads=NUM_CORES,
intra_op_parallelism_threads=NUM_CORES)
with tf.Session(config= sess_config) as sess:
sess.run(init)
if load:
self._load_(sess)
# Fit all training data
for epoch in range( self.training_epochs):
for (_x_, _y_) in getb(train_X, train_Y):
_y_ = np.reshape(_y_, [-1, 1])
sess.run(optimizer, feed_dict={ self.vars.xx: _x_, self.vars.yy: _y_})
# Display logs per epoch step
if (1+epoch) % self.display_step == 0:
cost = sess.run(tot_loss,
feed_dict={ self.vars.xx: train_X,
self.vars.yy: np.reshape(train_Y, [-1, 1])})
rsq = 1 - cost / yvar
logstr = "Epoch: {:4d}\tcost = {:.4f}\tR^2 = {:.4f}".format((epoch+1), cost, rsq)
print(logstr )
self.saver.save(sess, self.checkpoint_dir + 'model.ckpt',
global_step= 1+ epoch)
print("Optimization Finished!")
return self
When I run:
tfl = tflasso()
tfl.fit( train_X, train_Y , load = False)
I get output:
Epoch: 50 cost = 38.4705 R^2 = -1.2036
b1: 0.118122
Epoch: 100 cost = 26.4506 R^2 = -0.5151
b1: 0.133597
Epoch: 150 cost = 22.4330 R^2 = -0.2850
b1: 0.142261
Epoch: 200 cost = 20.0361 R^2 = -0.1477
b1: 0.147998
However, when I try to recover the parameters (even without killing the object):
tfl.fit( train_X, train_Y , load = True)
I get strange results. First of all, the loaded value does not correspond the saved one.
loading a session
loaded b1: 0.1 <------- Loaded another value than saved
Epoch: 50 cost = 30.8483 R^2 = -0.7670
b1: 0.137484
What is the right way to load, and probably first inspect the saved variables?

TL;DR: You should try to rework this class so that self.create_network() is called (i) only once, and (ii) before the tf.train.Saver() is constructed.
There are two subtle issues here, which are due to the code structure, and the default behavior of the tf.train.Saver constructor. When you construct a saver with no arguments (as in your code), it collects the current set of variables in your program, and adds ops to the graph for saving and restoring them. In your code, when you call tflasso(), it will construct a saver, and there will be no variables (because create_network() has not yet been called). As a result, the checkpoint should be empty.
The second issue is that—by default—the format of a saved checkpoint is a map from the name property of a variable to its current value. If you create two variables with the same name, they will be automatically "uniquified" by TensorFlow:
v = tf.Variable(..., name="weights")
assert v.name == "weights"
w = tf.Variable(..., name="weights")
assert v.name == "weights_1" # The "_1" is added by TensorFlow.
The consequence of this is that, when you call self.create_network() in the second call to tfl.fit(), the variables will all have different names from the names that are stored in the checkpoint—or would have been if the saver had been constructed after the network. (You can avoid this behavior by passing a name-Variable dictionary to the saver constructor, but this is usually quite awkward.)
There are two main workarounds:
In each call to tflasso.fit(), create the whole model afresh, by defining a new tf.Graph, then in that graph building the network and creating a tf.train.Saver.
RECOMMENDED Create the network, then the tf.train.Saver in the tflasso constructor, and reuse this graph on each call to tflasso.fit(). Note that you might need to do some more work to reorganize things (in particular, I'm not sure what you do with self.X and self.xlen) but it should be possible to achieve this with placeholders and feeding.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Saver.save is getting slower and slower in each fold - python

Related

What mean neural network is executed in tensorflow

How to load a trained tensorflow model

Tensorflow: Number of nodes in the graph keeps increasing as training goes on

Issue with Tensorflow save and restore model

tensorflow: saving and restoring session

Categories

Resources