I am learning TensorFlow and was implementing a simple neural network as explained in MNIST for Beginners in TensorFlow docs. Here is the link. The accuracy was about 80-90 %, as expected.
Then following the same article was MNIST for Experts using ConvNet. Instead of implementing that I decided to improve the beginner part. I know about Neural Nets and how they learn and the fact that deep networks can perform better than shallow networks. I modified the original program in MNIST for Beginner to implement a Neural network with 2 hidden layers each of 16 neurons.
It looks something like this :
Image of Network
Code for it
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
x = tf.placeholder(tf.float32, [None, 784], 'images')
y = tf.placeholder(tf.float32, [None, 10], 'labels')
# We are going to make 2 hidden layer neurons with 16 neurons each
# All the weights in network
W0 = tf.Variable(dtype=tf.float32, name='InputLayerWeights', initial_value=tf.zeros([784, 16]))
W1 = tf.Variable(dtype=tf.float32, name='HiddenLayer1Weights', initial_value=tf.zeros([16, 16]))
W2 = tf.Variable(dtype=tf.float32, name='HiddenLayer2Weights', initial_value=tf.zeros([16, 10]))
# All the biases for the network
B0 = tf.Variable(dtype=tf.float32, name='HiddenLayer1Biases', initial_value=tf.zeros([16]))
B1 = tf.Variable(dtype=tf.float32, name='HiddenLayer2Biases', initial_value=tf.zeros([16]))
B2 = tf.Variable(dtype=tf.float32, name='OutputLayerBiases', initial_value=tf.zeros([10]))
def build_graph():
"""This functions wires up all the biases and weights of the network
and returns the last layer connections
:return: returns the activation in last layer of network/output layer without softmax
"""
A1 = tf.nn.relu(tf.matmul(x, W0) + B0)
A2 = tf.nn.relu(tf.matmul(A1, W1) + B1)
return tf.matmul(A2, W2) + B2
def print_accuracy(sx, sy, tf_session):
"""This function prints the accuracy of a model at the time of invocation
:return: None
"""
correct_prediction = tf.equal(tf.argmax(y), tf.argmax(tf.nn.softmax(build_graph())))
correct_prediction_float = tf.cast(correct_prediction, dtype=tf.float32)
accuracy = tf.reduce_mean(correct_prediction_float)
print(accuracy.eval(feed_dict={x: sx, y: sy}, session=tf_session))
y_predicted = build_graph()
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=y_predicted))
model = tf.train.GradientDescentOptimizer(0.03).minimize(cross_entropy)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for _ in range(1000):
batch_x, batch_y = mnist.train.next_batch(50)
if _ % 100 == 0:
print_accuracy(batch_x, batch_y, sess)
sess.run(model, feed_dict={x: batch_x, y: batch_y})
The Output expected was supposed to be better than what could be achieved with when just a single layer (Assumed that W0 has shape of [784,10] and B0 has shape of [10])
def build_graph():
return tf.matmul(x,W0) + B0
Instead, the output says that network was not training at all. The Accuracy was not crossing 20% in any iteration.
Output
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
My Question
What is wrong with the above program that it does not generalize at all? How can I improve it more without using convolutional neural networks?
Your main mistake is network symmetry, because you initialized all weights to zeros. As a result, the weights are never updated. Change it to small random numbers and it will start learning. It's ok to initialize biases with zeros.
Another issue is pure technical: print_accuracy function is creating new nodes in the computational graph, and since you're calling it in the loop, the graph gets bloated and eventually will use up all memory.
You might also want to play with hyper-parameters and make the network bigger to increase its capacity.
Edit: I also spotted a bug in your accuracy calculation. It should be
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_predicted, 1))
Here's a complete code:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
x = tf.placeholder(tf.float32, [None, 784], 'images')
y = tf.placeholder(tf.float32, [None, 10], 'labels')
W0 = tf.Variable(dtype=tf.float32, name='InputLayerWeights', initial_value=tf.truncated_normal([784, 16]) * 0.001)
W1 = tf.Variable(dtype=tf.float32, name='HiddenLayer1Weights', initial_value=tf.truncated_normal([16, 16]) * 0.001)
W2 = tf.Variable(dtype=tf.float32, name='HiddenLayer2Weights', initial_value=tf.truncated_normal([16, 10]) * 0.001)
B0 = tf.Variable(dtype=tf.float32, name='HiddenLayer1Biases', initial_value=tf.ones([16]))
B1 = tf.Variable(dtype=tf.float32, name='HiddenLayer2Biases', initial_value=tf.ones([16]))
B2 = tf.Variable(dtype=tf.float32, name='OutputLayerBiases', initial_value=tf.ones([10]))
A1 = tf.nn.relu(tf.matmul(x, W0) + B0)
A2 = tf.nn.relu(tf.matmul(A1, W1) + B1)
y_predicted = tf.matmul(A2, W2) + B2
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_predicted, 1))
correct_prediction_float = tf.cast(correct_prediction, dtype=tf.float32)
accuracy = tf.reduce_mean(correct_prediction_float)
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=y_predicted))
optimizer = tf.train.AdamOptimizer(0.001).minimize(cross_entropy)
mnist = input_data.read_data_sets('mnist', one_hot=True)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(20000):
batch_x, batch_y = mnist.train.next_batch(64)
_, cost_val, acc_val = sess.run([optimizer, cross_entropy, accuracy], feed_dict={x: batch_x, y: batch_y})
if i % 100 == 0:
print('cost=%.3f accuracy=%.3f' % (cost_val, acc_val))
Related
import numpy as np
import tensorflow as tf
import pandas as pd
data = pd.read_csv('mnist_train.csv')
X = data.drop('label', axis=1).values
y = data['label'].values
with tf.Session() as sess:
Y = tf.one_hot(y, 10).eval()
hidden = [5, 4, 3]
def costa(y, yhat):
loss = tf.nn.softmax_cross_entropy_with_logits_v2(logits=yhat, labels=y)
loss = tf.reduce_sum(loss)
return loss
def train(cost):
train_op = tf.train.GradientDescentOptimizer(0.0001).minimize(cost)
return train_op
with tf.Graph().as_default():
X1 = tf.placeholder(tf.float32, [None, 784])
y1 = tf.placeholder(tf.float32, [None, 10])
w1 = tf.Variable(tf.random_normal((784, hidden[0])))
w2 = tf.Variable(tf.random_normal((hidden[0], hidden[1])))
w3 = tf.Variable(tf.random_normal((hidden[1], hidden[2])))
wo = tf.Variable(tf.random_normal((hidden[2], 10)))
b1 = tf.Variable(tf.random_normal((1, hidden[0])))
b2 = tf.Variable(tf.random_normal((1, hidden[1])))
b3 = tf.Variable(tf.random_normal((1, hidden[2])))
bo = tf.Variable(tf.random_normal((1, 10)))
layer1 = tf.nn.relu(tf.matmul(X1, w1) + b1)
layer2 = tf.nn.relu(tf.matmul(layer1, w2) + b2)
layer3 = tf.nn.relu(tf.matmul(layer2, w3) + b3)
layerout = (tf.matmul(layer3, wo) + bo)
yhat = layerout
cost = costa(y1, yhat)
train_op = train(cost)
init_op = tf.global_variables_initializer()
for epoch in range(1000):
with tf.Session() as sess:
sess.run(init_op)
sess.run(train_op, feed_dict={X1:X, y1:Y})
loss = sess.run(cost, feed_dict={X1:X, y1:Y})
print("Loss for epoch {}: {}".format(epoch, loss))
The loss stays around the same, jumps up and down a lot, but does not decrease accordingly.
I can't seem to find what is going wrong here, any help would be appeciated.
Is it the activations to the layers or am I getting the cost function wrong?
There are a couple of issues here:
You are running sess.run(init_op) every epoch. This means that the model parameters are being reset to random numbers every epoch, and therefore will be unable to learn. Try putting this op before for epoch in range(1000)
You are creating a new session every epoch. Change your code so it looks like this:
with tf.Session() as sess:
sess.run(init_op)
for epoch in range(1000):
sess.run(train_op, feed_dict={X1:X, y1:Y})
loss = sess.run(cost, feed_dict={X1:X, y1:Y})
print("Loss for epoch {}: {}".format(epoch, loss))
Initialising weights with a standard deviation of (2.0/neurons_in_prev_layer)**0.5 worked like a charm for me!
Also changed the hidden layers to, 2 hidden layers of 256, 256 neurons.
Okay one little tweak did the trick, I used RMSPropOptimizer instead and the loss started decreasing as expected.
I still have to figure out as to why this works, I’m still learning, but for now this is the solution I have.
Although the loss decreases very slowly.
I'm building a simple neural network that takes 3 values and gives 2 outputs.
I'm getting an accuracy of 67.5% and an average cost of 0.05
I have a training dataset of 1000 examples and 500 testing examples. I plan on making a larger dataset in the near future.
A little while ago I managed to get an accuracy of about 82% and sometimes a bit higher, but the cost was quite high.
I've been experimenting with adding another layer which is currently in the model and that is the reason I have got the loss under 1.0
I'm not sure what is going wrong, I'm new to Tensorflow and NNs in general.
Here is my code:
import tensorflow as tf
import numpy as np
import sys
sys.path.insert(0, '.../Dataset/Testing/')
sys.path.insert(0, '.../Dataset/Training/')
#other files
from TestDataNormaliser import *
from TrainDataNormaliser import *
learning_rate = 0.01
trainingIteration = 10
batchSize = 100
displayStep = 1
x = tf.placeholder("float", [None, 3])
y = tf.placeholder("float", [None, 2])
#layer 1
w1 = tf.Variable(tf.truncated_normal([3, 4], stddev=0.1))
b1 = tf.Variable(tf.zeros([4]))
y1 = tf.matmul(x, w1) + b1
#layer 2
w2 = tf.Variable(tf.truncated_normal([4, 4], stddev=0.1))
b2 = tf.Variable(tf.zeros([4]))
#y2 = tf.nn.sigmoid(tf.matmul(y1, w2) + b2)
y2 = tf.matmul(y1, w2) + b2
w3 = tf.Variable(tf.truncated_normal([4, 2], stddev=0.1))
b3 = tf.Variable(tf.zeros([2]))
y3 = tf.nn.sigmoid(tf.matmul(y2, w3) + b3) #sigmoid
#output
#wO = tf.Variable(tf.truncated_normal([2, 2], stddev=0.1))
#bO = tf.Variable(tf.zeros([2]))
a = y3 #tf.nn.softmax(tf.matmul(y2, wO) + bO) #y2
a_ = tf.placeholder("float", [None, 2])
#cost function
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y * tf.log(a)))
#cross_entropy = -tf.reduce_sum(y*tf.log(a))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cross_entropy)
#training
init = tf.global_variables_initializer() #initialises tensorflow
with tf.Session() as sess:
sess.run(init) #runs the initialiser
writer = tf.summary.FileWriter(".../Logs")
writer.add_graph(sess.graph)
merged_summary = tf.summary.merge_all()
for iteration in range(trainingIteration):
avg_cost = 0
totalBatch = int(len(trainArrayValues)/batchSize) #1000/100
#totalBatch = 10
for i in range(batchSize):
start = i
end = i + batchSize #100
xBatch = trainArrayValues[start:end]
yBatch = trainArrayLabels[start:end]
#feeding training data
sess.run(optimizer, feed_dict={x: xBatch, y: yBatch})
i += batchSize
avg_cost += sess.run(cross_entropy, feed_dict={x: xBatch, y: yBatch})/totalBatch
if iteration % displayStep == 0:
print("Iteration:", '%04d' % (iteration + 1), "cost=", "{:.9f}".format(avg_cost))
#
print("Training complete")
predictions = tf.equal(tf.argmax(a, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(predictions, "float"))
print("Accuracy:", accuracy.eval({x: testArrayValues, y: testArrayLabels}))
A few important notes:
You don't have non-linearities between your layers. This means you're training a network which is equivalent to a single-layer network, just with a lot of wasted computation. This is easily solved by adding a simple non-linearity, e.g. tf.nn.relu after each matmul/+ bias line, e.g. y2 = tf.nn.relu(y2) for all bar the last layer.
You are using a numerically unstable cross entropy implementation. I'd encourage you to use tf.nn.sigmoid_cross_entropy_with_logits, and removing your explicit sigmoid call (the input to your sigmoid function is what is generally referred to as the logits, or 'logistic units').
It seems you are not shuffling your dataset as you go. This could be particularly bad given your choice of optimizer, which leads us to...
Stochastic gradient descent is not great. For a boost without adding too much complication, consider using MomentumOptimizer instead. AdamOptimizer is my go-to, but play around with them.
When it comes to writing clean, maintainable code, I'd also encourage you to consider the following:
Use higher level APIs, e.g. tf.layers. It's good you know what's going on at a variable level, but it's easy to make a mistake with all that replicated code, and the default values with the layer implementations are generally pretty good
Consider using the tf.data.Dataset API for your data input. It's a bit scary at first, but it handles a lot of things like batching, shuffling, repeating epochs etc. very nicely
Consider using something like the tf.estimator.Estimator API for handling session runs, summary writing and evaluation.
With all those changes, you might have something that looks like the following (I've left your code in so you can roughly see the equivalent lines).
For graph construction:
def get_logits(features):
"""tf.layers API is cleaner and has better default values."""
# #layer 1
# w1 = tf.Variable(tf.truncated_normal([3, 4], stddev=0.1))
# b1 = tf.Variable(tf.zeros([4]))
# y1 = tf.matmul(x, w1) + b1
x = tf.layers.dense(features, 4, activation=tf.nn.relu)
# #layer 2
# w2 = tf.Variable(tf.truncated_normal([4, 4], stddev=0.1))
# b2 = tf.Variable(tf.zeros([4]))
# y2 = tf.matmul(y1, w2) + b2
x = tf.layers.dense(x, 4, activation=tf.nn.relu)
# w3 = tf.Variable(tf.truncated_normal([4, 2], stddev=0.1))
# b3 = tf.Variable(tf.zeros([2]))
# y3 = tf.nn.sigmoid(tf.matmul(y2, w3) + b3) #sigmoid
# N.B Don't take a non-linearity here.
logits = tf.layers.dense(x, 1, actiation=None)
# remove unnecessary final dimension, batch_size * 1 -> batch_size
logits = tf.squeeze(logits, axis=-1)
return logits
def get_loss(logits, labels):
"""tf.nn.sigmoid_cross_entropy_with_logits is numerically stable."""
# #cost function
# cross_entropy = tf.reduce_mean(-tf.reduce_sum(y * tf.log(a)))
return tf.nn.sigmoid_cross_entropy_with_logits(
logits=logits, labels=labels)
def get_train_op(loss):
"""There are better options than standard SGD. Try the following."""
learning_rate = 1e-3
# optimizer = tf.train.GradientDescentOptimizer(learning_rate)
optimizer = tf.train.MomentumOptimizer(learning_rate)
# optimizer = tf.train.AdamOptimizer(learning_rate)
return optimizer.minimize(loss)
def get_inputs(feature_data, label_data, batch_size, n_epochs=None,
shuffle=True):
"""
Get features and labels for training/evaluation.
Args:
feature_data: numpy array of feature data.
label_data: numpy array of label data
batch_size: size of batch to be returned
n_epochs: number of epochs to train for. None will result in repeating
forever/until stopped
shuffle: bool flag indicating whether or not to shuffle.
"""
dataset = tf.data.Dataset.from_tensor_slices(
(feature_data, label_data))
dataset = dataset.repeat(n_epochs)
if shuffle:
dataset = dataset.shuffle(len(feature_data))
dataset = dataset.batch(batch_size)
features, labels = dataset.make_one_shot_iterator().get_next()
return features, labels
For session running you could use this like you have (what I'd call 'the hard way')...
features, labels = get_inputs(
trainArrayValues, trainArrayLabels, batchSize, n_epochs, shuffle=True)
logits = get_logits(features)
loss = get_loss(logits, labels)
train_op = get_train_op(loss)
init = tf.global_variables_initializer()
# monitored sessions have the `should_stop` method, which works with datasets
with tf.train.MonitoredSession() as sess:
sess.run(init)
while not sess.should_stop():
# get both loss and optimizer step in the same session run
loss_val, _ = sess.run([loss, train_op])
print(loss_val)
# save variables etc, do evaluation in another graph with different inputs?
but I think you're better off using a tf.estimator.Estimator, though some people prefer tf.keras.Models.
def model_fn(features, labels, mode):
logits = get_logits(features)
loss = get_loss(logits, labels)
train_op = get_train_op(loss)
predictions = tf.greater(logits, 0)
accuracy = tf.metrics.accuracy(labels, predictions)
return tf.estimator.EstimatorSpec(
mode=mode, loss=loss, train_op=train_op,
eval_metric_ops={'accuracy': accuracy}, predictions=predictions)
def train_input_fn():
return get_inputs(trainArrayValues, trainArrayLabels, batchSize)
def eval_input_fn():
return get_inputs(
testArrayValues, testArrayLabels, batchSize, n_epochs=1, shuffle=False)
# Where variables and summaries will be saved to
model_dir = './model'
estimator = tf.estimator.Estimator(model_fn, model_dir)
estimator.train(train_input_fn, max_steps=max_steps)
estimator.evaluate(eval_input_fn)
Note if you use estimators the variables will be saved after training, so you won't need to re-train each time. If you want to reset, just delete the model_dir.
I see that you are using a softmax loss with sigmoidal activation functions in the last layer. Now let me explain the difference between softmax activations and sigmoidal.
You are now allowing the output of the network to be y=(0, 1), y=(1, 0), y=(0, 0) and y=(1, 1). This is because your sigmoidal activations "squish" each element in y between 0 and 1. Your loss function, however, assumes that your y vector sums to one.
What you need to do here is either to penalise the sigmoidal cross entropy function, which looks like this:
-tf.reduce_sum(y*tf.log(a))-tf.reduce_sum((1-y)*tf.log(1-a))
Or, if you want a to sum to one, you need to use softmax activations in your final layer (to get your a's) instead of sigmoids, which is implemented like this
exp_out = tf.exp(y3)
a = exp_out/tf reduce_sum(exp_out)
Ps. I'm using my phone on a train so please excuse typos
This graph trains a simple signal identity encoder, and in fact shows that the weights are being evolved by the optimizer:
import tensorflow as tf
import numpy as np
initia = tf.random_normal_initializer(0, 1e-3)
DEPTH_1 = 16
OUT_DEPTH = 1
I = tf.placeholder(tf.float32, shape=[None,1], name='I') # input
W = tf.get_variable('W', shape=[1,DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # weights
b = tf.get_variable('b', shape=[DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # biases
O = tf.nn.relu(tf.matmul(I, W) + b, name='O') # activation / output
#W1 = tf.get_variable('W1', shape=[DEPTH_1,DEPTH_1], initializer=initia, dtype=tf.float32) # weights
#b1 = tf.get_variable('b1', shape=[DEPTH_1], initializer=initia, dtype=tf.float32) # biases
#O1 = tf.nn.relu(tf.matmul(O, W1) + b1, name='O1')
W2 = tf.get_variable('W2', shape=[DEPTH_1,OUT_DEPTH], initializer=initia, dtype=tf.float32) # weights
b2 = tf.get_variable('b2', shape=[OUT_DEPTH], initializer=initia, dtype=tf.float32) # biases
O2 = tf.matmul(O, W2) + b2
O2_0 = tf.gather_nd(O2, [[0,0]])
estimate0 = 2.0*O2_0
eval_inp = tf.gather_nd(I,[[0,0]])
k = 1e-5
L = 5.0
distance = tf.reduce_sum( tf.square( eval_inp - estimate0 ) )
opt = tf.train.GradientDescentOptimizer(1e-3)
grads_and_vars = opt.compute_gradients(distance, [W, b, #W1, b1,
W2, b2])
clipped_grads_and_vars = [(tf.clip_by_value(g, -4.5, 4.5), v) for g, v in grads_and_vars]
train_op = opt.apply_gradients(clipped_grads_and_vars)
saver = tf.train.Saver()
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op)
for i in range(10000):
print sess.run([train_op, I, W, distance], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0})
for i in range(10):
print sess.run([eval_inp, W, estimate0], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0})
However, when I uncomment the intermediate hidden layer and train the resulting network, I see that the weights are not evolving anymore:
import tensorflow as tf
import numpy as np
initia = tf.random_normal_initializer(0, 1e-3)
DEPTH_1 = 16
OUT_DEPTH = 1
I = tf.placeholder(tf.float32, shape=[None,1], name='I') # input
W = tf.get_variable('W', shape=[1,DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # weights
b = tf.get_variable('b', shape=[DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # biases
O = tf.nn.relu(tf.matmul(I, W) + b, name='O') # activation / output
W1 = tf.get_variable('W1', shape=[DEPTH_1,DEPTH_1], initializer=initia, dtype=tf.float32) # weights
b1 = tf.get_variable('b1', shape=[DEPTH_1], initializer=initia, dtype=tf.float32) # biases
O1 = tf.nn.relu(tf.matmul(O, W1) + b1, name='O1')
W2 = tf.get_variable('W2', shape=[DEPTH_1,OUT_DEPTH], initializer=initia, dtype=tf.float32) # weights
b2 = tf.get_variable('b2', shape=[OUT_DEPTH], initializer=initia, dtype=tf.float32) # biases
O2 = tf.matmul(O1, W2) + b2
O2_0 = tf.gather_nd(O2, [[0,0]])
estimate0 = 2.0*O2_0
eval_inp = tf.gather_nd(I,[[0,0]])
distance = tf.reduce_sum( tf.square( eval_inp - estimate0 ) )
opt = tf.train.GradientDescentOptimizer(1e-3)
grads_and_vars = opt.compute_gradients(distance, [W, b, W1, b1,
W2, b2])
clipped_grads_and_vars = [(tf.clip_by_value(g, -4.5, 4.5), v) for g, v in grads_and_vars]
train_op = opt.apply_gradients(clipped_grads_and_vars)
saver = tf.train.Saver()
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op)
for i in range(10000):
print sess.run([train_op, I, W, distance], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0})
for i in range(10):
print sess.run([eval_inp, W, estimate0], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0})
The evaluation of estimate0 converging quickly in some fixed value that becomes independient from the input signal. I have no idea why this is happening
Question:
Any idea what might be wrong with the second example?
TL;DR: the deeper the neural network becomes, the more you should pay attention to the gradient flow (see this discussion of "vanishing gradients"). One particular case is variables initialization.
Problem analysis
I've added tensorboard summaries for the variables and gradients into both of your scripts and got the following:
2-layer network
3-layer network
The charts show the distributions of W:0 variable (the first layer) and how they are changed from 0 epoch to 1000 (clickable). Indeed, we can see, the rate of change is much higher in a 2-layer network. But I'd like to pay attention to the gradient distribution, which is much closer to 0 in a 3-layer network (first variance is around 0.005, the second one is around 0.000002, i.e. 1000 times smaller). This is the vanishing gradient problem.
Here's the helper code if you're interested:
for g, v in grads_and_vars:
tf.summary.histogram(v.name, v)
tf.summary.histogram(v.name + '_grad', g)
merged = tf.summary.merge_all()
writer = tf.summary.FileWriter('train_log_layer2', tf.get_default_graph())
...
_, summary = sess.run([train_op, merged], feed_dict={I: 2*np.random.rand(1, 1)-1})
if i % 10 == 0:
writer.add_summary(summary, global_step=i)
Solution
All deep networks suffer from this to some extent and
there is no universal solution that will auto-magically fix any network. But there are some techniques that can push it in the right direction. Initialization is one of them.
I replaced your normal initialization with:
W_init = tf.contrib.layers.xavier_initializer()
b_init = tf.constant_initializer(0.1)
There are lots of tutorials on Xavier init, you can take a look at this one, for example.
Note that I set the bias init to be slightly positive to make sure that ReLu outputs are positive for the most of neurons, at least in the beginning.
This changed the picture immediately:
The weights are still not moving quite as fast as before, but they are moving (note the scale of W:0 values) and the gradients distribution became much less peaked at 0, thus much better.
Of course, it's not the end. To improve it further, you should implement the full autoencoder, because currently the loss is affected by the [0,0] element reconstruction, so most outputs aren't used in optimization. You can also play with different optimizers (Adam would be my choice) and the learning rates.
That looks very exciting. Where exactly does this code belong? I've only recently discovered TensorBoard
is this in callbacks somehow:
for g, v in grads_and_vars:
tf.summary.histogram(v.name, v)
tf.summary.histogram(v.name + '_grad', g)
merged = tf.summary.merge_all()
writer = tf.summary.FileWriter('train_log_layer2', tf.get_default_graph())
is this after fiting:
_, summary = sess.run([train_op, merged], feed_dict={I: 2*np.random.rand(1, 1)-1})
if i % 10 == 0:
writer.add_summary(summary, global_step=i)
I am trying to build a deep network using TF after using Martin Gorner's video as a reference. I has some success with the shallow network example; however the deep network's accuracy is collapsing after reaching around 98% accuracy for some reason.
The network implemented is used to recognise MNIST numerical characters using a five layer network. I am training with batches of 100 for 10000 iterations. The accuracy steadily increases until it reaches around 98%, then collapses completely to 9.8%.
Any ideas please?
"""Tensor flow character recognition of Numerals"""
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data", one_hot=True)
# layer K will have 200 neuron and so on
K = 200
L = 100
M = 60
N = 30
# ----- Initialization -----
# the None will become the batch size of 100
# 28 by 28 grayscale images described by a single byte
X = tf.placeholder(tf.float32, [None, 784])
# training will require computing variables W and b
W1 = tf.Variable(tf.truncated_normal([28*28, K], stddev=0.1))
B1 = tf.Variable(tf.zeros([K]))
W2 = tf.Variable(tf.truncated_normal([K, L], stddev=0.1))
B2 = tf.Variable(tf.zeros([L]))
W3 = tf.Variable(tf.truncated_normal([L, M], stddev=0.1))
B3 = tf.Variable(tf.zeros([M]))
W4 = tf.Variable(tf.truncated_normal([M, N], stddev=0.1))
B4 = tf.Variable(tf.zeros([N]))
W5 = tf.Variable(tf.truncated_normal([N, 10], stddev=0.1))
B5 = tf.Variable(tf.zeros([10]))
init = tf.global_variables_initializer()
# ----- Model -----
# the model Y = WX+b
# reshape is used to flatted the image into a 1D array of 784 locations
# -1 is used to tell python to figure the reshape as there's only on solution
#Y = tf.nn.softmax(tf.matmul(tf.reshape(X, [-1, 784]), W) + b)
Y1 = tf.nn.relu(tf.matmul(X, W1) + B1)
Y2 = tf.nn.relu(tf.matmul(Y1, W2) + B2)
Y3 = tf.nn.relu(tf.matmul(Y2, W3) + B3)
Y4 = tf.nn.relu(tf.matmul(Y3, W4) + B4)
Y5 = tf.nn.softmax(tf.matmul(Y4, W5) + B5)
# placeholder for correct answers
# e.g. correct answer for 2 will be [0 0 1 0 0 0 0 0 0 0 ]
Y_ = tf.placeholder(tf.float32, [None, 10])
# the loss function
cross_entropy = tf.reduce_sum(Y_ * tf.log(Y5)) * -1
# ----- Success Metrics -----
# calculate the % of correct answers found in batch
is_correct = tf.equal(tf.argmax(Y5, 1), tf.argmax(Y_, 1))
accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))
# ----- Training Step -----
# pick an optimizer and tell it to minimize the cross entropy loss function
optimizer = tf.train.GradientDescentOptimizer(0.003)
train_step = optimizer.minimize(cross_entropy)
# create the execution session
sess = tf.Session()
sess.run(init)
for i in range(10000):
# load a batch of images from mnist
batch_X, batch_Y = mnist.train.next_batch(100)
train_data = {X: batch_X, Y_: batch_Y}
# ----- Execution -----
# train
sess.run(train_step, feed_dict=train_data)
# test for success
a, c = sess.run([accuracy, cross_entropy], feed_dict=train_data)
# this is only to display information
if i%100 == 0:
# check for success on whole data set
test_data = {X: mnist.test.images, Y_:mnist.test.labels}
a, c = sess.run([accuracy, cross_entropy], feed_dict=test_data)
print(a)
It is the accuracy on the validation set which collapses. right ?
so, you may be dramatically overfitting.
98% is possibly the best you can achieve with a network with such a capacity/structure.
Here is a basic Tensorflow network example (based on MNIST), complete code, that gives roughly 0.92 accuracy:
import numpy as np
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W) + b)
y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run() # or
tf.initialize_all_variables().run()
for _ in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))
Question: Why adding an extra layer, like in the code below, makes it so much worse that it drops to about 0.11 accuracy?
W = tf.Variable(tf.zeros([784, 100]))
b = tf.Variable(tf.zeros([100]))
h0 = tf.nn.relu(tf.matmul(x, W) + b)
W2 = tf.Variable(tf.zeros([100, 10]))
b2 = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(h0, W2) + b2)
The example does not properly initialise weights, but without a hidden layer, it turns out the effective linear softmax regression that the demo does is unaffected by that choice. Setting them all to zero is safe, but only for a single layer network.
When you make a deeper network though, this is a disastrous choice. You must use non-equal initialisation of neural network weights, and the usual quick way to do this is randomly.
Try this:
W = tf.Variable(tf.random_uniform([784, 100], -0.01, 0.01))
b = tf.Variable(tf.zeros([100]))
h0 = tf.nn.relu(tf.matmul(x, W) + b)
W2 = tf.Variable(tf.random_uniform([100, 10], -0.01, 0.01))
b2 = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(h0, W2) + b2)
The reason you need these non-identical weights is to do with how back propagation works - the values of weights in the layer determine how that layer will calculate gradients. If all the weights are the same, then all the gradients will be the same. Which means in turn that all weight updates are the same - everything changes in lockstep, and the behaviour is similar to if you have a single neuron in the hidden layer (because you have multiple neurons all with identical parameters), which can effectively only choose one class.
Neil explained you nicely how to fix your problem, I will add a little bit of explanation why this happens.
The problem is not so much that the gradients are all the same, but also by the fact the all of them are 0. This happens because relu(Wx + b) = 0 when W = 0 and b = 0. There is even a name for this - dead neuron.
The network does not progress at all and it does not matter whether you train it for 1 step of for 1mln. The results will not be different from a random choice and you see it with your accuracy of 0.11 (if you randomly select stuff you will get 0.10).