TensorFlow MNIST DataSet

TensorFlow MNIST DataSet - python

I started to learn TensorFlow by reading a book which started by classifying MNIST digits.
Link to the code
MINIBATCH_SIZE = 50
STEPS = 5000
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(STEPS):
batch = mnist.train.next_batch(MINIBATCH_SIZE)
if i % 100 == 0:
train_accuracy = sess.run(accuracy, feed_dict={x: batch[0], y_: batch[1],
keep_prob: 1.0})
print("step {}, training accuracy {}".format(i, train_accuracy))
sess.run(train_step, feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})
X = mnist.test.images.reshape(10, 1000, 784)
Y = mnist.test.labels.reshape(10, 1000, 10)
test_accuracy = np.mean(
[sess.run(accuracy, feed_dict={x: X[i], y_: Y[i], keep_prob: 1.0}) for i in range(10)])
print("test accuracy: {}".format(test_accuracy))
This is the block of code which executes the session. My question is - the for loop iterates STEPS times and batch is the mini batch of size 50.
Shouldn't we iterate STEPS times over the whole training set? This code only trains 50 images in an epoch.
What am I missing here? How does the next_batch() method work

Question
How many iterations should be do over a training set?
Answer
The answer usually is "as many as it takes". Okay, I admit that doesn't really help at first so let's get some jargon out of the way. There is a term Epoch that means one whole pass over the data. That's kinda that minimum IMHO. If you don't go over the whole data set at least once then what's the point? The MNIST data set has about 50,000 training images (60,000 if you don't split our validation). So your tensorflow graph in order to accomplish 1 Epoch would have to process 50,000 images. If your batch size is 50, then that's 1,000 batches. In your above code your batch size is 50 and you do 5,000 batches, in effect you are doing 5 Epochs worth of processing, or 5 passes over the whole data set.
Question
How does the next_batch() method work?
Answer
next_batch returns the specified number of images and labels from the training set. It wraps around so that when you are at the end of your data set it starts back at the beginning. It makes it easy to just get more data instead of having to code the looping and wrapping and slicing of the data yourself.

Batch size
Tensorflow is using gradient descent: at each step in your for loop, you evaluate the error between your predictions and the actual digits to find a gradient for adjusting the weights in your neural net.
You could indeed go through your entire test set each time, but you'd be processing the whole set just to nudge the weights, then going through it all again just to nudge the weights again, and so on. This would work, but be slow for large data sets.
At the other extreme, you could just pick one example in the loop. This is called stochastic gradient descent. Because you only process one example at each step it's very fast, but it's not guaranteed to converge and progress will be rather "jerky".
The code here is doing batch gradient descent which is a halfway house between these two approaches. By processing 50 examples each time the weights are adjusted, you get faster training than full gradient descent and more stability that stochastic gradient descent.
next_batch
The next_batch method just gets the next N records from the test set. By default, as here, the records are shuffled. You can call it as many times as you like; once the records are exhausted it will start again from another shuffled set. You can see the code here if you're interested.
Experimentation
There are 60,000 training images in the MNIST dataset. You could run this code three times, setting MINIBATCH_SIZE to 1, 50 and 60000 respectively to see for yourself how it performs in each case.

Related

Clarification about Gradient Accumulation

I'm trying to get a better understanding of how Gradient Accumulation works and why it is useful. To this end, I wanted to ask what is the difference (if any) between these two possible PyTorch-like implementations of a custom training loop with gradient accumulation:
gradient_accumulation_steps = 5
for batch_idx, batch in enumerate(dataset):
x_batch, y_true_batch = batch
y_pred_batch = model(x_batch)
loss = loss_fn(y_true_batch, y_pred_batch)
loss.backward()
if (batch_idx + 1) % gradient_accumulation_steps == 0: # (assumption: the number of batches is a multiple of gradient_accumulation_steps)
optimizer.step()
optimizer.zero_grad()
y_true_batches, y_pred_batches = [], []
gradient_accumulation_steps = 5
for batch_idx, batch in enumerate(dataset):
x_batch, y_true_batch = batch
y_pred_batch = model(x_batch)
y_true_batches.append(y_true_batch)
y_pred_batches.append(y_pred_batch)
if (batch_idx + 1) % gradient_accumulation_steps == 0: # (assumption: the number of batches is a multiple of gradient_accumulation_steps)
y_true = stack_vertically(y_true_batches)
y_pred = stack_vertically(y_pred_batches)
loss = loss_fn(y_true, y_pred)
loss.backward()
optimizer.step()
optimizer.zero_grad()
y_true_batches.clear()
y_pred_batches.clear()
Also, kind of as an unrelated question: Since the purpose of gradient accumulation is to mimic a larger batch size in cases where you have memory constraints, does it mean that I should also increase the learning rate proportionally?

1. The difference between the two programs:
Conceptually, your two implementations are the same: you forward gradient_accumulation_steps batches for each weight update.
As you already observed, the second method requires more memory resources than the first one.
There is, however, a slight difference: usually, loss functions implementation use mean to reduce the loss over the batch. When you use gradient accumulation (first implementation) you reduce using mean over each mini-batch, but using sum over the accumulated gradient_accumulation_steps mini-batches. To make sure the accumulated gradient implementation is identical to large batches implementation you need to be very careful in the way the loss function is reduced. In many cases you will need to divide the accumulated loss by gradient_accumulation_steps. See this answer for a detailed imlpementation.
2. Batch size and learning rate:
Learning rate and batch size are indeed related. When increasing the batch size one usually reduces the learning rate.
See, e.g.:
Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le, Don't Decay the Learning Rate, Increase the Batch Size (ICLR 2018).

How to inject data into a graph when using an input pipeline?

I am using an initializable iterator in my code. The iterator returns batches of size 100 from a csv dataset that has 20.000 entries. During training, however, I came across a problem. Consider this piece of code:
def get_dataset_iterator(batch_size):
# parametrized with batch_size
dataset = ...
return dataset.make_initializable_iterator()
## build a model and train it (x is the input of my model)
iterator = get_dataset_iterator(100)
x = iterator.get_next()
y = model(x)
## L1 norm as loss, this works because the model is an autoencoder
loss = tf.abs(x - y)
## training operator
train_op = tf.train.AdamOptimizer(0.01).minimize(loss)
with tf.Session() as sess:
for epoch in range(100):
sess.run(iterator.initializer)
# iterate through the whole dataset once during the epoch and
# do 200 mini batch updates
for _ in range(number_of_samples // batch_size):
sess.run(train_op)
print(f'Epoch {epoch} training done!')
# TODO: print loss after epoch here
I am interested in the training loss AFTER finishing the epoch. It makes most sense to me that I calculate the average loss over the whole training set (e.g. feeding all 20.000 samples through the network and averaging their loss). I could reuse the dataset iterator here with a batch size of 20.000, but I have declared x as the input.
So the questions are:
1.) Does the loss calculation over all 20.000 examples make sense? I have seen some people do the calculation with just a mini-batch (the last batch of the epoch).
2.) How can I calculate the loss over the whole training set with an input pipeline? I have to inject all of training data somehow, so that I can run sess.run(loss) without calculating it over only 100 samples (because x is declared as input).
EDIT FOR CLARIFICATION:
If I wrote my training loop the following way, there would be some things that bother me:
with tf.Session() as sess:
for epoch in range(100):
sess.run(iterator.initializer)
# iterate through the whole dataset once during the epoch and
# do 200 mini batch updates
for _ in range(number_of_samples // batch_size):
_, current_loss = sess.run([train_op, loss])
print(f'Epoch {epoch} training done!')
print(current_loss)
Firstly, loss would still be evaluated before doing the last weight update. That means whatever comes out is not the latest value. Secondly, I would not be able to access current_loss after exiting the for loop so I would not be able to print it.

1) Loss calculation over the whole training set (before updating weights) does make sense and is called batch gradient descent (despite using the whole training set and not a mini batch).
However, calculating a loss for your whole dataset before updating weights is slow (especially with large datasets) and training will take a long time to converge. As a result, using a mini batch of data to calculate loss and update weights is what is normally done instead. Although using a mini batch will produce a noisy estimate of the loss it is actually good enough estimate to train networks with enough training iterations.
EDIT:
I agree that the loss value you print will not be the latest loss with the latest updated weights. Probably for most cases it really doesn't make much different or change results so people just go with how you have wrote the code above. However, if you really want to obtain the true latest loss value after you have done training (to print out) then you will just have to run the loss op again after you have done a train op e.g.:
for _ in range(number_of_samples // batch_size):
sess.run([train_op])
current_loss = sess.run([loss])
This will get your true latest value. Of course this wont be on the whole dataset and will be just for a minibatch of 100. Again the value is likely a good enough estimate but if you wish to calculate exact loss for whole dataset you will have to run through your entire set e.g. another loop and then average the loss:
...
# Train loop
for _ in range(number_of_samples // batch_size):
_, current_loss = sess.run([train_op, loss])
print(f'Epoch {epoch} training done!')
# Calculate loss of whole train set after training an epoch.
sess.run(iterator.initializer)
current_loss_list = []
for _ in range(number_of_samples // batch_size):
_, current_loss = sess.run([loss])
current_loss_list.append(current_loss)
train_loss_whole_dataset = np.mean(current_loss_list)
print(train_loss_whole_dataset)
EDIT 2:
As pointed out doing the serial calls to train_op then loss will call the iterator twice and so things might not work out nicely (e.g. run out of data). Therefore my 2nd bit of code will be better to use.

I think the following code will answer your questions:
(A) how can you print the batch loss AFTER performing the train step? (B) how can you calculate the loss over the entire training set, even though the dataset iterator gives only a batch each time?
import tensorflow as tf
import numpy as np
dataset_size = 200
batch_size= 5
dimension = 4
# create some training dataset
dataset = tf.data.Dataset.\
from_tensor_slices(np.random.normal(2.0,size=(dataset_size,dimension)).
astype(np.float32))
dataset = dataset.batch(batch_size) # take batches
iterator = dataset.make_initializable_iterator()
x = tf.cast(iterator.get_next(),tf.float32)
w = tf.Variable(np.random.normal(size=(1,dimension)).astype(np.float32))
loss_func = lambda x,w: tf.reduce_mean(tf.square(x-w)) # notice that the loss function is a mean!
loss = loss_func(x,w) # this is the loss that will be minimized
train_op = tf.train.GradientDescentOptimizer(0.1).minimize(loss)
# we are going to use control_dependencies so that we know that we have a loss calculation AFTER the train step
with tf.control_dependencies([train_op]):
loss_after_train_op = loss_func(x,w) # this is an identical loss, but will only be calculated AFTER train_op has
# been performed
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
# train one epoch
sess.run(iterator.initializer)
for i in range(dataset_size//batch_size):
# the training step will update the weights based on ONE batch of examples each step
loss1,_,loss2 = sess.run([loss,train_op,loss_after_train_op])
print('train step {:d}. batch loss before step: {:f}. batch loss after step: {:f}'.format(i,loss1,loss2))
# evaluate loss on entire training set. Notice that this calculation assumes the the loss is of the form
# tf.reduce_mean(...)
sess.run(iterator.initializer)
epoch_loss = 0
for i in range(dataset_size // batch_size):
batch_loss = sess.run(loss)
epoch_loss += batch_loss*batch_size
epoch_loss = epoch_loss/dataset_size
print('loss over entire training dataset: {:f}'.format(epoch_loss))
As for your question whether it makes sense to calculate loss over the entire training set - yes, it makes sense, for evaluation purposes. It usually does not make sense to perform training steps which are based on all of the training set since this set is usually very large and you want to update your weights more often, without needing to go over the entire training set each time.

Why is TensorFlow's `tf.data` package slowing down my code?

I'm just learning to use TensorFlow's tf.data API, and I've found that it is slowing my code down a lot, measured in time per epoch. This is the opposite of what it's supposed to do, I thought. I wrote a simple linear regression program to test it out.
Tl;Dr: With 100,000 training data, tf.data slows time per epoch down by about a factor of ten, if you're using full batch training. Worse if you use smaller batches. The opposite is true with 500 training data.
My question: What is going on? Is my implementation flawed? Other sources I've read have tf.data improving speeds by about 30%.
import tensorflow as tf
import numpy as np
import timeit
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
tf.logging.set_verbosity(tf.logging.ERROR)
n_epochs = 10
input_dimensions_list = [10]
def function_to_approximate(x):
return np.dot(x, random_covector).astype(np.float32) + np.float32(.01) * np.random.randn(1,1).astype(np.float32)
def regress_without_tfData(n_epochs, input_dimension, training_inputs, training_labels):
tf.reset_default_graph()
weights = tf.get_variable("weights", initializer=np.random.randn(input_dimension, 1).astype(np.float32))
X = tf.placeholder(tf.float32, shape=(None, input_dimension), name='X')
Y = tf.placeholder(tf.float32, shape=(None, 1), name='Y')
prediction = tf.matmul(X,weights)
loss = tf.reduce_mean(tf.square(tf.subtract(prediction, Y)))
loss_op = tf.train.AdamOptimizer(.01).minimize(loss)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for _ in range(n_epochs):
sess.run(loss_op, feed_dict={X: training_inputs, Y:training_labels})
def regress_with_tfData(n_epochs, input_dimension, training_inputs, training_labels, batch_size):
tf.reset_default_graph()
weights = tf.get_variable("weights", initializer=np.random.randn(input_dimension, 1).astype(np.float32))
X,Y = data_set.make_one_shot_iterator().get_next()
prediction = tf.matmul(X, weights)
loss = tf.reduce_mean(tf.square(tf.subtract(prediction, Y)))
loss_op = tf.train.AdamOptimizer(.01).minimize(loss)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
while True:
try:
sess.run(loss_op)
except tf.errors.OutOfRangeError:
break
for input_dimension in input_dimensions_list:
for data_size in [500, 100000]:
training_inputs = np.random.randn(data_size, input_dimension).astype(np.float32)
random_covector = np.random.randint(-5, 5, size=(input_dimension, 1))
training_labels = function_to_approximate(training_inputs)
print("Not using tf.data, with data size "
"{}, input dimension {} and training with "
"a full batch, it took an average of "
"{} seconds to run {} epochs.\n".
format(
data_size,
input_dimension,
timeit.timeit(
lambda: regress_without_tfData(
n_epochs, input_dimension,
training_inputs, training_labels
),
number=3
),
n_epochs))
for input_dimension in input_dimensions_list:
for data_size, batch_size in [(500, 50), (500, 500), (100000, 50), (100000, 100000)]:
training_inputs = np.random.randn(data_size, input_dimension).astype(np.float32)
random_covector = np.random.randint(-5, 5, size=(input_dimension, 1))
training_labels = function_to_approximate(training_inputs)
data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels))
data_set = data_set.repeat(n_epochs)
data_set = data_set.batch(batch_size)
print("Using tf.data, with data size "
"{}, and input dimension {}, and training with "
"batch size {}, it took an average of {} seconds "
"to run {} epochs.\n".
format(
data_size,
input_dimension,
batch_size,
timeit.timeit(
lambda: regress_with_tfData(
n_epochs, input_dimension,
training_inputs, training_labels,
batch_size
),
number=3
)/3,
n_epochs
))
This outputs for me:
Not using tf.data, with data size 500, input dimension 10 and training
with a full batch, it took an average of 0.20243382899980134 seconds
to run 10 epochs.
Not using tf.data, with data size 100000, input dimension 10 and
training with a full batch, it took an average of 0.2431719040000644
seconds to run 10 epochs.
Using tf.data, with data size 500, and input dimension 10, and
training with batch size 50, it took an average of 0.09512088866661846
seconds to run 10 epochs.
Using tf.data, with data size 500, and input dimension 10, and
training with batch size 500, it took an average of
0.07286913600000844 seconds to run 10 epochs.
Using tf.data, with data size 100000, and input dimension 10, and
training with batch size 50, it took an average of 4.421892363666605
seconds to run 10 epochs.
Using tf.data, with data size 100000, and input dimension 10, and
training with batch size 100000, it took an average of
2.2555197536667038 seconds to run 10 epochs.
Edit: Fixed an important issue that Fred Guth pointed out. It didn't much affect the results, though.

I wanted to test the dataset API which seems to be really convenient for processing data. I did a lot of time testing about this API in CPU, GPU and multi-GPU way for small and large NN with different type of data.
First thing, It seems to me that your code is ok. But I need to point that your NN is just one simple layer.
Now, the dataset API is not suitable for your type of NN but for NN with a lot more complexity. Why ? For several reasons that I explain below (founded in my quest of understanding the dataset API).
Firstly, in one hand the dataset API processes data each batch whereas in the other hand data are preprocessed. Therefore, if it fits your RAM, you can save time by preprocessing the data. Here your data are just to "simple". If you want to test what i am saying, try to find a really really big dataset to process. Nevertheless, the dataset API can be tuned with prefetching data. You can take a look to this tutorial that explain really well why it is good to process data with prefetch.
Secondly, in my quest of dataset API for Multi-GPU training, I discovered that as far as i know the old pre-processing way is faster than dataset API for small Neural Network. You can verify that by creating a simple stackable RNN which take a sequence in input. You can try different size of stack (i have tested 1, 2, 10 and 20). You will see that, using the dataset API, on 1-GPU or on 4-GPUs, the time did not differ for small RNN stacks (1, 2 and 5).
To summarize, the dataset API is suitable for Neural Network that have data that can't be pre-process. Depending on your task, it may be more convenient to pre-process data, for example if you want to tweak your NN in order to improve it. I agree that the dataset API is really cool for batch, padding and also convenient for shuffling large amount of data but it's also not suitable for multi-GPU training.

First:
You are recreating the dataset unnecessarily.
data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels))
Create the dataset prior to the loop and change the regress_with_tfData input signature to use dataset instead of training_inputs and training_labels.
Second:
The problem here is that minibatches of size 50 or even 500 are too small to compensate the cost of td.data building latency. You should increase the minibatch size. Interestingly you did so with a minibatch of size 100000, but then maybe it is too big ( I am not certain of this, I think it would need more tests).
There are a couple of things you could try:
1) Increase the minibatch size to something like 10000 and see if you get an improvement
2) Change your pipeline to use an iterator, example:
data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels))
data_set = data_set.repeat(n_epochs)
data_set = data_set.batch(batch_size)
iterator = data_set.make_one_shot_iterator()
....
next_element = iterator.get_next()

That is because you are comparing apples with bananas.
On one hand, when using placeholders, you are providing a monolithic tensor as is. On the other hand, when using Dataset, you are slicing the tensor into individual samples. This is very different.
The equivalent of providing a monolothic placeholder tensor with the Dataset pipeline is by using tf.data.Dataset.from_tensors. When I use from_tensors in your example, I get similar (actually smaller) computation times than with placeholders.
If you want to compare a more sophisticated pipeline using from_tensor_slices, you should use a fair comparison with placeholders. For example, shuffle your data. Add some preprocessing on your slices. I have no doubt you will observe the performance gain that makes people switch to this pipeline.

One possible thing you are missing is a prefetch. Add a prefetch of 1 at the end of your data pipeline like so:
data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels))
data_set = data_set.repeat(n_epochs)
data_set = data_set.batch(batch_size).prefetch(1)
Adding a prefetch of 1 at the end of your dataset pipeline means you try and fetch 1 batch of data while training is happening. This way you wont be waiting around while the batch is prepared, it should be ready to go as soon as each train iteration is done.

The accepted answer doesn't help longer valid, as the TF behavior has changed. Per documentation:
from_tensors produces a dataset containing only a single element. To
slice the input tensor into multiple elements, use from_tensor_slices
instead.
This means you cannot batch it
X = np.arange(10)
data = tf.data.Dataset.from_tensors( X )
data = data.batch(2)
for t in data.as_numpy_iterator():
print(t)
# only one row, whereas expected 5 !!!
The documentation recommends from_tensor_slices. But this has quite some overhead when compared to numpy slicing. Slow slicing is an open issue https://github.com/tensorflow/tensorflow/issues/39750
Essentially, slicing in TF is slow and impacts input-bound or light models such as small networks (regression, word2vec).

role of feed dictionary command in the training loop

I am trying to understand training loop. While calculating training and test accuracy we replace x and y_ by training and test sets but while printing the result for cross entropy why we feed x and y_ by batch_xs and batch_ys respectively? I know that we have to assign a value to the placeholders but while calculating test accuracy batch_xs and batch_ys remain same and both are using train set values.
x = tf.placeholder(tf.float32, [None, nPixels])
y_ = tf.placeholder(tf.float32, [None, nLabels])
cross_entropy = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(y, y_))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
with tf.Session() as sess:
tf.initialize_all_variables().run()
for j in range(nSteps):
i = nstep % nSteps
batch_xs = np.reshape(x_train,(nSteps,bSize,nPixels))
batch_ys = np.reshape(y_train,(nSteps,bSize,nLabels))
sess.run(train_step, feed_dict={x: batch_xs[i], y_: batch_ys[i]})
if j % 100 ==0:
train_acc = sess.run(accuracy, feed_dict={x: x_train,y_: y_train})
test_acc = sess.run(accuracy, feed_dict={x: x_test, y_: y_test}))
loss = sess.run(cross_entropy, feed_dict={x: batch_xs[i], y_: batch_ys[i]})

You can calculate your training loss at the same time you do a training step (so that you do not have to make a separate call, passing your batch again later) with something like:
for j in range(nSteps):
# ...
_, train_loss = sess.run([train_step, cross_entropy],
feed_dict={x: batch_xs[i], y_: batch_ys[i]})
As for why you do not need to resize your training/testing sets when calculating your accuracy, let's look at the dimensions of your data throughout your code. You start out with:
x = tf.placeholder(tf.float32, [None, nPixels])
y_ = tf.placeholder(tf.float32, [None, nLabels])
That means that the dimensions of x are [num_samples, nPixels] and the dimensions of y_ are [num_samples, nLabels]. I assume x_train and y_train are just some subsample of this, so [num_train, nPixels] and [num_train, nLabels] respectively. Next you reshape these for your batches:
batch_xs = np.reshape(x_train,(nSteps,bSize,nPixels))
batch_ys = np.reshape(y_train,(nSteps,bSize,nLabels))
Now batch_xs has dimensions [num_steps, batch_size, nPixels] and batch_ys has dimensions [num_steps, batch_size, nLabels]. Note that the last dimensions, the number of features or output dimensions, hasn't changed.
Now for each iteration of your training loop, you take one element from the first dimension of these lists with batch_xs[i] and batch_ys[i]. The dimensions of these guys are now [batch_size, nPixels], and [batch_size, nLables], respectively. Again, the last dimension hasn't changed.
Finally you call your accuracy or cross_entropy ops on a batch or on the whole training set and TF doesn't care which it is! This is because the number of features/labels you are giving it is the same either way, and all that's changing is the number of data elements you are passing. In one case you are passing batch_size elements, and in the other you are passing num_train or num_test elements, but the last dimensions that specify what each sample looks like is the same! TF is magic and figures out how to handle the difference.
Old answer
It looks like you are just trying to get training loss out of your loop, thus you are passing your training batch to your cross_entropy function. A lot of times you might evaluate this at every batch and store the results so that you can plot your training loss over time later, but sometimes this calculation takes long enough that you want to only calculate it every so often. This seems to be the case for your code.
In general, it is a good idea to look at both training and test (or validation) loss while your training loop runs. This is because you are better able to see the effects of overfitting, where training loss continues to decrease and testing loss levels off or begins to increase. Something like:
if j % 100 == 0:
train_loss = sess.run(cross_entropy, feed_dict={x: batch_xs[i], y_: batch_ys[i]})
test_loss = sess.run(cross_entropy, feed_dict={x: x_test, y_: y_test})
# print or store your training and test loss
One of the major reasons for monitoring your testing loss is to stop your training early in the event of overfitting. You can do this by storing your best testing loss, and then comparing each new testing loss to your stored best. Each time you get a new testing loss that does not beat your best, you increment a counter, and when the counter hits a certain number (say 10 or 20), you kill the training loop and call it good. This way you catch when your training has stalled out on the validation/testing set, which is a better indicator of when it will no longer get better at generalizing to new data.
If you encounter a new validation loss that is better than your best, you store it and reset your counter.
Finally, it is often best to hold out two sets of data, one called the validation set, and one called the testing set. When you split it up like this, you typically use the validation set while training to look for overfitting and perform early stopping, and then you run one final test for accuracy/loss against the testing set that the network has never seen before to give you an idea of its overall generalization performance. That looks something like:
for epoch:
for batch in batches:
x = batch.x
y = batch.y
sess.run(train_step, feed_dict={x: x, y_: y})
if j % 100 == 0:
train_loss = sess.run(cross_entropy, feed_dict={x: x, y_: y})
test_loss = sess.run(cross_entropy, feed_dict={x: x_val, y_: y_val}) # note validation data
# print or store your training and test loss
# check for early stopping
test_loss = sess.run(cross_entropy, feed_dict={x: x_test, y_: y_test})
# Or test accuracy or whatever you want to evaluate.
# Point is the network never saw this data until now.

Very low GPU usage during training in Tensorflow

I am trying to train a simple multi-layer perceptron for a 10-class image classification task, which is a part of the assignment for the Udacity Deep-Learning course. To be more precise, the task is to classify letters rendered from various fonts (the dataset is called notMNIST).
The code I ended up with looks fairly simple, but no matter what I always get very low GPU usage during training. I measure load with GPU-Z and it shows just 25-30%.
Here is my current code:
graph = tf.Graph()
with graph.as_default():
tf.set_random_seed(52)
# dataset definition
dataset = Dataset.from_tensor_slices({'x': train_data, 'y': train_labels})
dataset = dataset.shuffle(buffer_size=20000)
dataset = dataset.batch(128)
iterator = dataset.make_initializable_iterator()
sample = iterator.get_next()
x = sample['x']
y = sample['y']
# actual computation graph
keep_prob = tf.placeholder(tf.float32)
is_training = tf.placeholder(tf.bool, name='is_training')
fc1 = dense_batch_relu_dropout(x, 1024, is_training, keep_prob, 'fc1')
fc2 = dense_batch_relu_dropout(fc1, 300, is_training, keep_prob, 'fc2')
fc3 = dense_batch_relu_dropout(fc2, 50, is_training, keep_prob, 'fc3')
logits = dense(fc3, NUM_CLASSES, 'logits')
with tf.name_scope('accuracy'):
accuracy = tf.reduce_mean(
tf.cast(tf.equal(tf.argmax(y, 1), tf.argmax(logits, 1)), tf.float32),
)
accuracy_percent = 100 * accuracy
with tf.name_scope('loss'):
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
# ensures that we execute the update_ops before performing the train_op
# needed for batch normalization (apparently)
train_op = tf.train.AdamOptimizer(learning_rate=1e-3, epsilon=1e-3).minimize(loss)
with tf.Session(graph=graph) as sess:
tf.global_variables_initializer().run()
step = 0
epoch = 0
while True:
sess.run(iterator.initializer, feed_dict={})
while True:
step += 1
try:
sess.run(train_op, feed_dict={keep_prob: 0.5, is_training: True})
except tf.errors.OutOfRangeError:
logger.info('End of epoch #%d', epoch)
break
# end of epoch
train_l, train_ac = sess.run(
[loss, accuracy_percent],
feed_dict={x: train_data, y: train_labels, keep_prob: 1, is_training: False},
)
test_l, test_ac = sess.run(
[loss, accuracy_percent],
feed_dict={x: test_data, y: test_labels, keep_prob: 1, is_training: False},
)
logger.info('Train loss: %f, train accuracy: %.2f%%', train_l, train_ac)
logger.info('Test loss: %f, test accuracy: %.2f%%', test_l, test_ac)
epoch += 1
Here's what I tried so far:
I changed the input pipeline from simple feed_dict to tensorflow.contrib.data.Dataset. As far as I understood, it is supposed to take care of the efficiency of the input, e.g. load data in a separate thread. So there should not be any bottleneck associated with the input.
I collected traces as suggested here: https://github.com/tensorflow/tensorflow/issues/1824#issuecomment-225754659
However, these traces didn't really show anything interesting. >90% of the train step is matmul operations.
Changed batch size. When I change it from 128 to 512 the load increases from ~30% to ~38%, when I increase it further to 2048, the load goes to ~45%. I have 6Gb GPU memory and dataset is single channel 28x28 images. Am I really supposed to use such a big batch size? Should I increase it further?
Generally, should I worry about the low load, is it really a sign that I am training inefficiently?
Here's the GPU-Z screenshots with 128 images in the batch. You can see low load with occasional spikes to 100% when I measure accuracy on the entire dataset after each epoch.

MNIST size networks are tiny and it's hard to achieve high GPU (or CPU) efficiency for them, I think 30% is not unusual for your application. You will get higher computational efficiency with larger batch size, meaning you can process more examples per second, but you will also get lower statistical efficiency, meaning you need to process more examples total to get to target accuracy. So it's a trade-off. For tiny character models like yours, the statistical efficiency drops off very quickly after a 100, so it's probably not worth trying to grow the batch size for training. For inference, you should use the largest batch size you can.

On my nVidia GTX 1080, if I use a convolutional neural network on the MNIST database, the GPU load is ~68%.
If I switch to a simple, non-convolutional network, then the GPU load is ~20%.
You can replicate these results by building successively more advanced models in the tutorial Building Autoencoders in Keras by Francis Chollet.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.