Tensorflow MNIST tutorial - Test Accuracy very low

Tensorflow MNIST tutorial - Test Accuracy very low - python

I have been starting with tensorflow and have been following this standard MNIST tutorial.
However, in contrast to the expected 92% accuracy, the accuracy obtained over the training set as well as the test set is not going beyond 67%.
I am familiar with softmax and multinomial regression and have obtained more than 94% using scratch python implementation as well as using sklearn.linear_model.LogisticRegression.
I had tried the same using CIFAR-10 dataset and in that case the accuracy was too low and just about 10% which is equal to randomly assigning classes. This has made me doubt my installation of tensorflow, yet I am unsure about this.
Here is my implementation of Tensorflow MNIST tutorial. I would request if someone could have a look at my implementation.

You constructed your graph, specified the loss function, and created the optimizer (which is correct). The problem is that you use your optimizer only once:
sess_tf.run(train_step, feed_dict={x: train_images_reshaped[0:1000], y_: train_labels[0:1000]})
So basically you run your gradient descent only once. Clearly you can't converge fast after only one tiny step in the right direction. You need to do something along the lines:
for _ in xrange(many_steps):
X, Y = get_a_new_batch_from(mnist_data)
sess_tf.run(train_step, feed_dict={x: X, y_: Y})
If you will not be able to figure out how to modify my pseudo-code, consult the tutorial, because based on my memory they covered this nicely.

W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
The initialization of W may cause your network does not learn anything but random guessing. Because the grad will be zero and the backprop actually doesn't work at all.
You'd better to init the W using tf.Variable(tf.truncated_normal([784, 10], mean=0.0, stddev=0.01)) see https://www.tensorflow.org/api_docs/python/tf/truncated_normal for more.

Not sure if this is still relevant in June 2018, but the MNIST beginner tutorial no longer matches the example code on Github. If you download and run the example code, it does indeed give you the suggested 92% accuracy.
I noticed two things going wrong when following the tutorial:
1) Accidentally calling softmax twice
The tutorial first tells you to define y as follows:
y = tf.nn.softmax(tf.matmul(x, W) + b)
But later suggests that you define cross-entropy using tf.nn.softmax_cross_entropy_with_logits, which would make it easy to accidentally do the following:
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y)
This would send your logits (tf.matmul(x, W) + b) through softmax twice, which resulted in me getting stuck at a 67% accuracy.
However I noticed that even fixing this still only brought me up to a very unstable 80-90% accuracy, which leads me to the next issue:
2) tf.nn.softmax_cross_entropy_with_logits() is deprecated
They haven't updated the tutorial yet, but the tf.nn.softmax_cross_entropy_with_logits page indicates that this function has been deprecated.
In the example code on Github they've replaced it with tf.losses.sparse_softmax_cross_entropy(labels=y_, logits=y).
However you can't just swap the function out - the example code also changes the dimensionality on many of the other lines.
My suggestion to anyone doing this for the first time would be to download the current working example code from Github and try to match it up to the tutorial concepts without taking the instructions literally. Hopefully they will get around to updating it!

Related

Does it make sense to backpropagate a loss calculated from an earlier layer through the entire network?

Suppose you have a neural network with 2 layers A and B. A gets the network input. A and B are consecutive (A's output is fed into B as input). Both A and B output predictions (prediction1 and prediction2) Picture of the described architecture
You calculate a loss (loss1) directly after the first layer (A) with a target (target1). You also calculate a loss after the second layer (loss2) with its own target (target2).
Does it make sense to use the sum of loss1 and loss2 as the error function and back propagate this loss through the entire network? If so, why is it "allowed" to back propagate loss1 through B even though it has nothing to do with it?
This question is related to this question
https://datascience.stackexchange.com/questions/37022/intuition-importance-of-intermediate-supervision-in-deep-learning
but it does not answer my question sufficiently.
In my case, A and B are unrelated modules. In the aforementioned question, A and B would be identical. The targets would be the same, too.
(Additional information)
The reason why I'm asking is that I'm trying to understand LCNN (https://github.com/zhou13/lcnn) from this paper.
LCNN is made up of an Hourglass backbone, which then gets fed into MultiTask Learner (creates loss1), which in turn gets fed into a LineVectorizer Module (loss2). Both loss1 and loss2 are then summed up here and then back propagated through the entire network here.
Even though I've visited several deep learning lectures, I didn't know this was "allowed" or makes sense to do. I would have expected to use two loss.backward(), one for each loss. Or is the pytorch computational graph doing something magical here? LCNN converges and outperforms other neural networks which try to solve the same task.

Yes, It is "allowed" and also makes sense.
From the question, I believe you have understood most of it so I'm not going to details about why this multi-loss architecture can be useful. I think the main part that has made you confused is why does "loss1" back-propagate through "B"? and the answer is: It doesn't. The fact is that loss1 is calculated using this formula:
loss1 = SOME_FUNCTION(label, y_hat)
and y_hat(prediction1) is only dependent on layers before it. Hence, the gradient of this loss only flows through layers before this section (A) and not the ones after it (B). To better understand this, you could again check the mathematics of artificial neural networks. The loss2, on the other hand, back-propagates through all of the network (including part A). When you use a cumulative loss (Loss = loss1 + loss2), a framework like Pytorch will automatically follow the gradient of every predicted label to the first layer.

Keras model.predict function not giving similar results as model.evaluate

I have trained an image classification model using Keras. The model after training has 95% accuracy on training data and using model.evaluate on an untouched validation data, I get ~92.8% accuracy.
But when I use model.predict function instead to get the prediction probabilities and get the predicted class with maximum probability, I get ~80% accuracy.
The complete code is available as a colab notebook on the following link - https://colab.research.google.com/drive/1RQ2KnT2sVsdCAWfpsDj_kcMZiqiwJrpc?usp=sharing
You should be able to run everything and see the difference in accuracy. The problem lies in the code blocks as shown below

To make both the accuracies from predict_generator and evaluate_generator same, you have to set the following 3 things in your functions as parameters:
shuffle = False
pickle_safe = True
workers = 1
Your program might be running on different threads and these settings make it run on the main thread.

The solution I could find so far after having posted the issue here and keras official github (without any answer for weeks) is that instead of using Keras, I used tf.keras. Most of the implementation stayed the same. And the "Shuffle" option is definitely messing up the accuracy. The lower accuracy with "Shuffle = False" is a bug in the keras implementation probably. The tf.keras implementation gives the same result in the "evaluate_generator" function. And the predict and evaluate function outputs with respect to accuracy match. I hope if other people encounter this error, they don't waste as much time as I did on the issue.

Loss and learning rate scaling strategies for Tensorflow distributed training when using TF Estimator

For those who don't want to read the whole story:
TL; DR: When using TF Estimator, do we have to scale learning rate by the factor by which we increase batch size (I know this is the right way, I am not sure if TF handles this internally)? Similarly, do we have to scale per example loss by global batch size (batch_size_per_replica * number of replicas)?
Documentation on Tensorflow distributed learning is confusing. I need clarification on below points.
It is now understood that if you increase the batch size by a factor of k then you need to increase the learning rate by k (see this and this paper). However, Tensoflow official page on distributed learning makes no clarifying comment about this. They do mention here that learning rate needs to be adjusted. Do they handle the learning rate scaling by themselves? To make matters more complicated, the behavior is different in Keras and tf.Estimator (see next point). Any suggestions on should I increase the LR by a factor of K or not when I am using tf.Estimator?
It is widely accepted that the per example loss should be scaled by global_batch_size = batch_size_per_replica * number of replicas. Tensorflow mentions it here but then when illustrating how to achieve this with a tf.Estimator, they either forget or the scaling by global_batch_size is not required. See here, in the code snippet, loss is defined as follows.
loss = tf.reduce_sum(loss) * (1. / BATCH_SIZE)
and BATCH_SIZE to the best of my understanding is defined above as per replica batch size.
To complicate things further, the scaling is handled automatically if you are using Keras (for reasons I will never understand, it would have been better to keep everything consistent).

The learning rate is not automatically scaled by the global step. As you said, they even suggest that you might need to adjust the learning rate, but then again only in some cases, so that's not the default. I suggest that you do increase the learning rate manually.
If we take a look at a simple tf.Estimator, the tf.estimator.DNNClassifier (link), the default loss_reduction is losses_utils.ReductionV2.SUM_OVER_BATCH_SIZE. If we got to that Reduction (found here), we see that it's a policy for how to combine losses form indiviual samples together. On a single machine, we would just use tf.reduce_mean, but you can't use that in a distributed setting (as mentioned in the next link). The Reduction leads us to here, which shows you 1) an implementation of how you would implement the global step and 2) explains why. As they are telling you you should implement this yourself, this implies it's not handled by tf.Estimator. Also note that yo can find some explanations on the Reduction page they state the differences bewteen Keras and Estimators about these params.

How to use tf.train.ExponentialMovingAverage to evaluate a model which is trained without exponential decay in TensorFlow?

First of all, maybe I have understood the exponential decay and what tf. train. exponentialMovingAverage does not yet do completely. Maybe:) As far as I have understood so far, a copy is made for each variable with which different calculations can be carried out in order to obtain either a better training or evaluation result. If someone knows a clear explanation, I would be very happy about it (in English or German).
Now to my problem, I'm training a CNN using tf.train.ExponentialMovingAverage (if more code is needed I'll add it later, please comment).
...
variable_averages =
tf.train.ExponentialMovingAverage(get_movining_average_decay(), global_step)
variables_averages_op =
variable_averages.apply(tf.trainable_variables())
...
Afterwards I evaluate the model using tf.train.ExponentialMovingAverage, which works fine.
...
variable_averages =
tf.train.ExponentialMovingAverage(get_movining_average_decay())
# This may be the problem,
# if no shadow variables have been created during the training,
# no variables can be restored.
variables_to_restore = variable_averages.variables_to_restore()
saver = tf.train.Saver(variables_to_restore)
...
But if I train the CNN without tf.train.ExponentialMovingAverage I cannot evaluate the model using tf.train.ExponentialMovingAverage. Very likely this line would have to be changed, but I wouldn't know what by.
variables_to_restore = variable_averages.variables_to_restore()
How to customize the this code or is that not possible in TensorFlow or have I still not understood exponential decay enough and in general it is not possible?

What way to pass trough data?

I'm working with Tensorflow but I'm pretty new to Python and machine learning. If I have a tensor of an image from my input pipeline what would be the best way to train it? Like in the basics, how would I handle passing trough data? I have structure I would like to use (I know I can get certain data from certain things like tensors) but I'm just not sure how to do so.
I'm very new to this so all help would be greatly appreciated.
def model(image_tensor):
tf.summary.image(img)
return predictions
def loss(predictions, labels):
return some_loss
def train(some_loss):
return train_op

Tensorflow may be a bit complicated for someone new to machine learning and python. My advice is to go through the excellent notebook tutorials that exist on tensorflow sites and start to understand the abstraction.
However, before that, I would use python with numpy (and sometimes scipy) to implement basic machine methods like Stochastic Gradient Descent just to ensure that you understand how the algorithms work. Then implement a simple logistic regression.
So why do I ask you to do all that? Well, because once you get a good handle of how to work with machine learning algorithm and how tedious it can be find the gradients, you will understand why tensorflow abstraction is useful.
I'm going to provide you with some simple examples dealing with MNIST.
from sklearn.datasets import load_digits
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
mnist = load_digits(2)
print("y [shape: {}] {}] : {}".format(y.shape,y[:10]))
print("x [shape: {}] {}]".format(x.shape)
What i've essentially done above is load two digits from the MNIST dataset (0 and 1) and display the array for the vector y and matrix (x).
If you want to see how the images look you can plt.imshow(X[0].reshape([8,8]))
The next step is to start defining our placeholder and variables
input_x = tf.placeholder(tf.float32,shape=[None,X.shape[1]], name = "input_x")
input_y = tf.placeholder(tf.float32,shape=[None,],name = "labels")
weights = tf.Variable(initial_value = tf.zeros(shape=[X.shape[1],1]), name="weights")
b = tf.Variable(initial_value=0.0, name = "bias")
We have done here is defined two placeholder in tensorflow and have told what the variables should expect as an input. I also gave the placeholder a name for debugging purpose.
prediction_y = tf.squeeze(tf.nn.sigmoid(tf.add(tf.matmul(input_x,weights),tf.cast(b,tf.float32))))
loss = tf.losses.log_loss(input_y,prediction_y)
optimizer = tf.train.Adamoptimizer(0.001).minimize(loss)
There you go, that's a logistic regression in tensorflow. What the last block does is apply the activation function to our input vectors, defines the loss function and then defines an optimizer for the loss function.
The final step is to run it.
from sklearn.metrics import roc_auc_score
s.run(tf.global_variables_initializer())
for i in range(10):
s.run(optimizer,{input_X:X_train, input_y: y_train})
loss_i = s.run(loss, {input_x:x_train,input_y:y_train})
print("loss at iteration {}: {}".format(i, loss_i))
That's essentially how you run your data through tensorflow. This code may have typos, I don't have python on this machine so i'm writing based on memory. However the basic idea is there. Hope this helps.
Edit: Also since you asked best way to train image data. My answer to you would be there isn't a "best". Building a CNN is a typical approach that you may want to experiment using assuming you have large number of classified images. Prior to that people also used support vectors relatively well for classifying images.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.