I am using an initializable iterator in my code. The iterator returns batches of size 100 from a csv dataset that has 20.000 entries. During training, however, I came across a problem. Consider this piece of code:
def get_dataset_iterator(batch_size):
# parametrized with batch_size
dataset = ...
return dataset.make_initializable_iterator()
## build a model and train it (x is the input of my model)
iterator = get_dataset_iterator(100)
x = iterator.get_next()
y = model(x)
## L1 norm as loss, this works because the model is an autoencoder
loss = tf.abs(x - y)
## training operator
train_op = tf.train.AdamOptimizer(0.01).minimize(loss)
with tf.Session() as sess:
for epoch in range(100):
sess.run(iterator.initializer)
# iterate through the whole dataset once during the epoch and
# do 200 mini batch updates
for _ in range(number_of_samples // batch_size):
sess.run(train_op)
print(f'Epoch {epoch} training done!')
# TODO: print loss after epoch here
I am interested in the training loss AFTER finishing the epoch. It makes most sense to me that I calculate the average loss over the whole training set (e.g. feeding all 20.000 samples through the network and averaging their loss). I could reuse the dataset iterator here with a batch size of 20.000, but I have declared x as the input.
So the questions are:
1.) Does the loss calculation over all 20.000 examples make sense? I have seen some people do the calculation with just a mini-batch (the last batch of the epoch).
2.) How can I calculate the loss over the whole training set with an input pipeline? I have to inject all of training data somehow, so that I can run sess.run(loss) without calculating it over only 100 samples (because x is declared as input).
EDIT FOR CLARIFICATION:
If I wrote my training loop the following way, there would be some things that bother me:
with tf.Session() as sess:
for epoch in range(100):
sess.run(iterator.initializer)
# iterate through the whole dataset once during the epoch and
# do 200 mini batch updates
for _ in range(number_of_samples // batch_size):
_, current_loss = sess.run([train_op, loss])
print(f'Epoch {epoch} training done!')
print(current_loss)
Firstly, loss would still be evaluated before doing the last weight update. That means whatever comes out is not the latest value. Secondly, I would not be able to access current_loss after exiting the for loop so I would not be able to print it.
1) Loss calculation over the whole training set (before updating weights) does make sense and is called batch gradient descent (despite using the whole training set and not a mini batch).
However, calculating a loss for your whole dataset before updating weights is slow (especially with large datasets) and training will take a long time to converge. As a result, using a mini batch of data to calculate loss and update weights is what is normally done instead. Although using a mini batch will produce a noisy estimate of the loss it is actually good enough estimate to train networks with enough training iterations.
EDIT:
I agree that the loss value you print will not be the latest loss with the latest updated weights. Probably for most cases it really doesn't make much different or change results so people just go with how you have wrote the code above. However, if you really want to obtain the true latest loss value after you have done training (to print out) then you will just have to run the loss op again after you have done a train op e.g.:
for _ in range(number_of_samples // batch_size):
sess.run([train_op])
current_loss = sess.run([loss])
This will get your true latest value. Of course this wont be on the whole dataset and will be just for a minibatch of 100. Again the value is likely a good enough estimate but if you wish to calculate exact loss for whole dataset you will have to run through your entire set e.g. another loop and then average the loss:
...
# Train loop
for _ in range(number_of_samples // batch_size):
_, current_loss = sess.run([train_op, loss])
print(f'Epoch {epoch} training done!')
# Calculate loss of whole train set after training an epoch.
sess.run(iterator.initializer)
current_loss_list = []
for _ in range(number_of_samples // batch_size):
_, current_loss = sess.run([loss])
current_loss_list.append(current_loss)
train_loss_whole_dataset = np.mean(current_loss_list)
print(train_loss_whole_dataset)
EDIT 2:
As pointed out doing the serial calls to train_op then loss will call the iterator twice and so things might not work out nicely (e.g. run out of data). Therefore my 2nd bit of code will be better to use.
I think the following code will answer your questions:
(A) how can you print the batch loss AFTER performing the train step? (B) how can you calculate the loss over the entire training set, even though the dataset iterator gives only a batch each time?
import tensorflow as tf
import numpy as np
dataset_size = 200
batch_size= 5
dimension = 4
# create some training dataset
dataset = tf.data.Dataset.\
from_tensor_slices(np.random.normal(2.0,size=(dataset_size,dimension)).
astype(np.float32))
dataset = dataset.batch(batch_size) # take batches
iterator = dataset.make_initializable_iterator()
x = tf.cast(iterator.get_next(),tf.float32)
w = tf.Variable(np.random.normal(size=(1,dimension)).astype(np.float32))
loss_func = lambda x,w: tf.reduce_mean(tf.square(x-w)) # notice that the loss function is a mean!
loss = loss_func(x,w) # this is the loss that will be minimized
train_op = tf.train.GradientDescentOptimizer(0.1).minimize(loss)
# we are going to use control_dependencies so that we know that we have a loss calculation AFTER the train step
with tf.control_dependencies([train_op]):
loss_after_train_op = loss_func(x,w) # this is an identical loss, but will only be calculated AFTER train_op has
# been performed
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
# train one epoch
sess.run(iterator.initializer)
for i in range(dataset_size//batch_size):
# the training step will update the weights based on ONE batch of examples each step
loss1,_,loss2 = sess.run([loss,train_op,loss_after_train_op])
print('train step {:d}. batch loss before step: {:f}. batch loss after step: {:f}'.format(i,loss1,loss2))
# evaluate loss on entire training set. Notice that this calculation assumes the the loss is of the form
# tf.reduce_mean(...)
sess.run(iterator.initializer)
epoch_loss = 0
for i in range(dataset_size // batch_size):
batch_loss = sess.run(loss)
epoch_loss += batch_loss*batch_size
epoch_loss = epoch_loss/dataset_size
print('loss over entire training dataset: {:f}'.format(epoch_loss))
As for your question whether it makes sense to calculate loss over the entire training set - yes, it makes sense, for evaluation purposes. It usually does not make sense to perform training steps which are based on all of the training set since this set is usually very large and you want to update your weights more often, without needing to go over the entire training set each time.
Related
In PyTorch, I want to evaluate my model on the validation set every eval_step during training, and I wrote code like this:
def tune(model, loader_train, loader_dev, optimizer, epochs, eval_step):
for epoch in range(epochs):
for step,x in enumerate(loader_train):
optimizer.zero_grad()
loss = model(x)
loss.backward()
optimizer.step()
if step % eval_step == 0:
model.eval()
test(model, loader_dev)
model.train()
When eval_step = int(len(loader_train)/2) and eval_step = int(len(loader_train)/8), they lead to quite different metric result after training through one whole epoch (which means the second output for the former differs the eighth output for the latter).
Could anyone explain why?
The length of loader_train is 20000 (it depends on batch size), and here is my testing script:
def test(model, loader_dev):
preds = []
labels = []
for step,x in enumerate(loader_dev):
preds.append(model(x).view(-1))
labels.apend(x['label'].view(-1))
metric = cal_metric(preds, labels)
logger.info(metric)
I think you probably set 'shffule=True' in your dataloader. Even though you fix 'random seed', dataloader in torch will generate different results if you use another dataloader while using current dataloader. In the scenario you describe, it may cause your model get data input in different order and then result in different metric result.
I started to learn TensorFlow by reading a book which started by classifying MNIST digits.
Link to the code
MINIBATCH_SIZE = 50
STEPS = 5000
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(STEPS):
batch = mnist.train.next_batch(MINIBATCH_SIZE)
if i % 100 == 0:
train_accuracy = sess.run(accuracy, feed_dict={x: batch[0], y_: batch[1],
keep_prob: 1.0})
print("step {}, training accuracy {}".format(i, train_accuracy))
sess.run(train_step, feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})
X = mnist.test.images.reshape(10, 1000, 784)
Y = mnist.test.labels.reshape(10, 1000, 10)
test_accuracy = np.mean(
[sess.run(accuracy, feed_dict={x: X[i], y_: Y[i], keep_prob: 1.0}) for i in range(10)])
print("test accuracy: {}".format(test_accuracy))
This is the block of code which executes the session. My question is - the for loop iterates STEPS times and batch is the mini batch of size 50.
Shouldn't we iterate STEPS times over the whole training set? This code only trains 50 images in an epoch.
What am I missing here? How does the next_batch() method work
Question
How many iterations should be do over a training set?
Answer
The answer usually is "as many as it takes". Okay, I admit that doesn't really help at first so let's get some jargon out of the way. There is a term Epoch that means one whole pass over the data. That's kinda that minimum IMHO. If you don't go over the whole data set at least once then what's the point? The MNIST data set has about 50,000 training images (60,000 if you don't split our validation). So your tensorflow graph in order to accomplish 1 Epoch would have to process 50,000 images. If your batch size is 50, then that's 1,000 batches. In your above code your batch size is 50 and you do 5,000 batches, in effect you are doing 5 Epochs worth of processing, or 5 passes over the whole data set.
Question
How does the next_batch() method work?
Answer
next_batch returns the specified number of images and labels from the training set. It wraps around so that when you are at the end of your data set it starts back at the beginning. It makes it easy to just get more data instead of having to code the looping and wrapping and slicing of the data yourself.
Batch size
Tensorflow is using gradient descent: at each step in your for loop, you evaluate the error between your predictions and the actual digits to find a gradient for adjusting the weights in your neural net.
You could indeed go through your entire test set each time, but you'd be processing the whole set just to nudge the weights, then going through it all again just to nudge the weights again, and so on. This would work, but be slow for large data sets.
At the other extreme, you could just pick one example in the loop. This is called stochastic gradient descent. Because you only process one example at each step it's very fast, but it's not guaranteed to converge and progress will be rather "jerky".
The code here is doing batch gradient descent which is a halfway house between these two approaches. By processing 50 examples each time the weights are adjusted, you get faster training than full gradient descent and more stability that stochastic gradient descent.
next_batch
The next_batch method just gets the next N records from the test set. By default, as here, the records are shuffled. You can call it as many times as you like; once the records are exhausted it will start again from another shuffled set. You can see the code here if you're interested.
Experimentation
There are 60,000 training images in the MNIST dataset. You could run this code three times, setting MINIBATCH_SIZE to 1, 50 and 60000 respectively to see for yourself how it performs in each case.
I am trying to train a simple multi-layer perceptron for a 10-class image classification task, which is a part of the assignment for the Udacity Deep-Learning course. To be more precise, the task is to classify letters rendered from various fonts (the dataset is called notMNIST).
The code I ended up with looks fairly simple, but no matter what I always get very low GPU usage during training. I measure load with GPU-Z and it shows just 25-30%.
Here is my current code:
graph = tf.Graph()
with graph.as_default():
tf.set_random_seed(52)
# dataset definition
dataset = Dataset.from_tensor_slices({'x': train_data, 'y': train_labels})
dataset = dataset.shuffle(buffer_size=20000)
dataset = dataset.batch(128)
iterator = dataset.make_initializable_iterator()
sample = iterator.get_next()
x = sample['x']
y = sample['y']
# actual computation graph
keep_prob = tf.placeholder(tf.float32)
is_training = tf.placeholder(tf.bool, name='is_training')
fc1 = dense_batch_relu_dropout(x, 1024, is_training, keep_prob, 'fc1')
fc2 = dense_batch_relu_dropout(fc1, 300, is_training, keep_prob, 'fc2')
fc3 = dense_batch_relu_dropout(fc2, 50, is_training, keep_prob, 'fc3')
logits = dense(fc3, NUM_CLASSES, 'logits')
with tf.name_scope('accuracy'):
accuracy = tf.reduce_mean(
tf.cast(tf.equal(tf.argmax(y, 1), tf.argmax(logits, 1)), tf.float32),
)
accuracy_percent = 100 * accuracy
with tf.name_scope('loss'):
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
# ensures that we execute the update_ops before performing the train_op
# needed for batch normalization (apparently)
train_op = tf.train.AdamOptimizer(learning_rate=1e-3, epsilon=1e-3).minimize(loss)
with tf.Session(graph=graph) as sess:
tf.global_variables_initializer().run()
step = 0
epoch = 0
while True:
sess.run(iterator.initializer, feed_dict={})
while True:
step += 1
try:
sess.run(train_op, feed_dict={keep_prob: 0.5, is_training: True})
except tf.errors.OutOfRangeError:
logger.info('End of epoch #%d', epoch)
break
# end of epoch
train_l, train_ac = sess.run(
[loss, accuracy_percent],
feed_dict={x: train_data, y: train_labels, keep_prob: 1, is_training: False},
)
test_l, test_ac = sess.run(
[loss, accuracy_percent],
feed_dict={x: test_data, y: test_labels, keep_prob: 1, is_training: False},
)
logger.info('Train loss: %f, train accuracy: %.2f%%', train_l, train_ac)
logger.info('Test loss: %f, test accuracy: %.2f%%', test_l, test_ac)
epoch += 1
Here's what I tried so far:
I changed the input pipeline from simple feed_dict to tensorflow.contrib.data.Dataset. As far as I understood, it is supposed to take care of the efficiency of the input, e.g. load data in a separate thread. So there should not be any bottleneck associated with the input.
I collected traces as suggested here: https://github.com/tensorflow/tensorflow/issues/1824#issuecomment-225754659
However, these traces didn't really show anything interesting. >90% of the train step is matmul operations.
Changed batch size. When I change it from 128 to 512 the load increases from ~30% to ~38%, when I increase it further to 2048, the load goes to ~45%. I have 6Gb GPU memory and dataset is single channel 28x28 images. Am I really supposed to use such a big batch size? Should I increase it further?
Generally, should I worry about the low load, is it really a sign that I am training inefficiently?
Here's the GPU-Z screenshots with 128 images in the batch. You can see low load with occasional spikes to 100% when I measure accuracy on the entire dataset after each epoch.
MNIST size networks are tiny and it's hard to achieve high GPU (or CPU) efficiency for them, I think 30% is not unusual for your application. You will get higher computational efficiency with larger batch size, meaning you can process more examples per second, but you will also get lower statistical efficiency, meaning you need to process more examples total to get to target accuracy. So it's a trade-off. For tiny character models like yours, the statistical efficiency drops off very quickly after a 100, so it's probably not worth trying to grow the batch size for training. For inference, you should use the largest batch size you can.
On my nVidia GTX 1080, if I use a convolutional neural network on the MNIST database, the GPU load is ~68%.
If I switch to a simple, non-convolutional network, then the GPU load is ~20%.
You can replicate these results by building successively more advanced models in the tutorial Building Autoencoders in Keras by Francis Chollet.
I have a TensorFlow model, and one part of this model evaluates the accuracy. The accuracy is just another node in the tensorflow graph, that takes in logits and labels.
When I want to plot the training accuracy, this is simple: I have something like:
tf.scalar_summary("Training Accuracy", accuracy)
tf.scalar_summary("SomethingElse", foo)
summary_op = tf.merge_all_summaries()
writer = tf.train.SummaryWriter('/me/mydir/', graph=sess.graph)
Then, during my training loop, I have something like:
for n in xrange(1000):
...
summary, ..., ... = sess.run([summary_op, ..., ...], feed_dict)
writer.add_summary(summary, n)
...
Also inside that for loop, every say, 100 iterations, I want to evaluate the validation accuracy. I have a separate feed_dict for this, and I am able to evaluate the validation accuracy very nicely in python.
However, here is my problem: I want to make another summary for the validation accuracy, by using the accuracy node. I am not clear on how to do this though. Since I have the accuracy node it makes sense that I should be able to re-use it, but I am unsure how to do this exactly, such that I can also get the validation accuracy written out as a separate scalar_summary...
How might this be possible?
You can reuse the the accuracy node but you need to use two different SummaryWriters, one for the training runs and one for the test data. Also you have to assign the scalar summary for accuracy to a variable.
accuracy_summary = tf.scalar_summary("Training Accuracy", accuracy)
tf.scalar_summary("SomethingElse", foo)
summary_op = tf.merge_all_summaries()
summaries_dir = '/me/mydir/'
train_writer = tf.train.SummaryWriter(summaries_dir + '/train', sess.graph)
test_writer = tf.train.SummaryWriter(summaries_dir + '/test')
Then in your training loop you have the normal training and record your summaries with the train_writer. In addition you run the graph on the test set each 100th iteration and record only the accuracy summary with the test_writer.
# Record train set summaries, and train
summary, _ = sess.run([summary_op, train_step], feed_dict=...)
train_writer.add_summary(summary, n)
if n % 100 == 0: # Record summaries and test-set accuracy
summary, acc = sess.run([accuracy_summary, accuracy], feed_dict=...)
test_writer.add_summary(summary, n)
print('Accuracy at step %s: %s' % (n, acc))
You can then point TensorBoard to the parent directory (summaries_dir) and it will load both data sets.
This can be also found in the TensorFlow HowTo's https://www.tensorflow.org/versions/r0.11/how_tos/summaries_and_tensorboard/index.html
To run the same operation but get summaries with different feed_dict data, simply attach two summary ops to that op. Say you want to run accuracy op on both validation and test data and want to get summaries for both:
validation_acc_summary = tf.summary.scalar('validation_accuracy', accuracy) # intended to run on validation set
test_acc_summary = tf.summary.scalar('test_accuracy', accuracy) # intended to run on test set
with tf.Session() as sess:
# do your thing
# ...
# accuracy op just needs labels y_ and input x to compute logits
validation_summary_str = sess.run(validation_acc_summary, feed_dict=feed_dict={x: mnist.validation.images,y_: mnist.validation.labels})
test_summary_str = sess.run(test_acc_summary, feed_dict={x: mnist.test.images,y_: mnist.test.labels})
# assuming you have a tf.summary.FileWriter setup
file_writer.add_summary(validation_summary_str)
file_writer.add_summary(test_summary_str)
Also remember you can always pull raw (scalar) data out of the protobuff summary_str like this and do your own logging.
I would like to train the weights of a model based on the sum of the loss value of several batches. However it seems that once you run the graph for each of the individual batches, the object that is returned is just a regular numpy array. So when you try and use an optimizer like GradientDescentOptimizer, it no longer has information about the variables that were used to calculate the sum of the losses, so it can't find the gradients of the weights that what help minimize the loss. Here's an example tensorflow script to illustrate what I'm talking about:
weights = tf.Variable(tf.ones([num_feature_values], tf.float32))
feature_values = tf.placeholder(tf.int32, shape=[num_feature_values])
labels = tf.placeholder(tf.int32, shape=[1])
loss_op = some_loss_function(weights, feature_values, labels)
with tf.Session() as sess:
for batch in batches:
feed_dict = fill_feature_values_and_labels(batch)
#Calculates loss for one batch
loss = sess.run(loss_op, feed_dict=feed_dict)
#Adds it to total loss
total_loss += loss
# Want to train weights to minimize total_loss, however this
# doesn't work because the graph has already been run.
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(total_loss)
with tf.Session() as sess:
for step in xrange(num_steps):
sess.run(optimizer)
The total_loss is a numpy array and thus cannot be used in the optimizer. Does anyone know a way around the problem, where I want to use information across many batches but still need the graph intact in order to preserve the fact that the total_loss is a function of the weights?
The thing you optimize in any of the trainers must be a part of the graph, here what you train on is the actual realized result, so it won't work.
I think the way you should probably do this is to construct your input as a batch of batches e.g.
intput = tf.placeholder("float", (number_of_batches, batch_size, input_size)
Then have your target also be a 3d tensor which can be trained on.