I am currently trying to create a simple ANN learning environment for reinforcement learning. I already did fitting via neuronal network to substitute a physical model for a neuronal network. Now i would like to create a simple reinforcement learning model out of curiosity.
To create this model I thought it would be a good option to manipulate the loss function to not calculate the difference between expectation and model output but to run a simple simulation a few rounds and calculate where the model can earn points for a specific target. In case of the example code below the model is a simple mass damper system that starts with a random excitation and speed. The model can exert a force upon it. The points are based upon the distance from the equilibrium. At the end I invert the points by dividing one by the amount of points earned. I am not sure if this is the right approach but I wanted to try anyway for the sake of learning. Now I get the error message No gradients provided for any variable: . I am not sure how to solve it.
Here is my code:
import time
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.layers import Input, Dense, Conv2D, Reshape,concatenate, Flatten, UpSampling2D, AveragePooling2D,LayerNormalization
import random
#Physical Parameters
m = 1 #kg
k = 1 #N/m
c = 0.01
dt = 0.01
opt = keras.optimizers.Adam(learning_rate=0.01)
def getnewstate(u,v,f):
#Calculate new state of mass spring damper system
a = (f-v*c-k*u)/m
v = v+a*dt
u = u+v*dt
return (u,v)
def generatemodel():
#Generate simple keras model
kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.01)
bias_initializer=tf.keras.initializers.Zeros()
InputLayer = Input(shape=(2))
Outputlayer = Dense(1,activation='linear')(InputLayer)
model = Model(inputs=InputLayer, outputs=Outputlayer)
return model
def lossfunction(u,v,model):
#Costume loss function
loss = 0;
t = 0;
t_last = 0;
#do for 100 timesteps (to ses if it runs at all)
for j in range(100):
x = [];
x.append(np.array([u,v]))
x = np.array(x)
f=model(x)
f=f.numpy()[0][0]
(u,v) = getnewstate(u,v,f)
points = 1000/(abs(u)+1)
loss=loss+1/points
t += dt;
return(loss)
def dotraining(model):
#traububg loop
for epoch in range(100):
print("\nStart of epoch %d" % (epoch,))
start_time = time.time()
loss_value = 0;
# Iterate over the batches of the dataset.
for step in range(100):
with tf.GradientTape() as tape:
loss_value=[]
for i in range(10):
#Randomize Starting Condition
u = random.random()-0.5;
v = random.random()-0.5;
x = [];
x.append(np.array([u,v]))
x = np.array(x)
#feed model
logits = model(x, training=True)
#calculate loss
loss_value.append(lossfunction(u,v,model))
print(step)
print(loss_value)
loss = loss_value
loss = tf.convert_to_tensor(loss)
grads = tape.gradient(loss, model.trainable_weights)
opt.apply_gradients(zip(grads, model.trainable_weights))
# Log every 200 batches.
if step % 200 == 0:
print(
"Training loss (for one batch) at step %d: %.4f"
% (step, float(loss_value))
)
print("Seen so far: %d samples" % ((step + 1) * 64))
print("Time taken: %.2fs" % (time.time() - start_time))
model=generatemodel()
x = []
x.append(np.array([1.0,2.0]))
print(np.shape(x))
f=model(np.array(x))
dotraining(model)
The problem is that, when you cast f to numpy here:
f=f.numpy()[0][0]
it stops being a tensor and tensorflow doesn't track its gradient any more.
For tensorflow to compute gradient, you must get from inputs to loss using only tensor operations.
Related
I'm trying to implement and train a neural network using the JAX library and its little neural network submodule, "Stax". Since this library doesn't come with an implementation of binary cross entropy, I wrote my own:
def binary_cross_entropy(y_hat, y):
bce = y * jnp.log(y_hat) + (1 - y) * jnp.log(1 - y_hat)
return jnp.mean(-bce)
I implemented a simple neural network and trained it on MNIST, and started to get suspicious of some of the results I was getting. So I implemented the same setup in Keras, and I immediately got wildly different results! The same model, trained in the same way on the same data, was getting 90% training accuracy in Keras instead of around 50% in JAX. Eventually I tracked down part of the issue to my naive implementation of cross-entropy, which is supposedly numerically unstable. Following this post and this code I found, I wrote the following new version:
def binary_cross_entropy_stable(y_hat, y):
y_hat = jnp.clip(y_hat, 0.000001, 0.9999999)
logits = jnp.log(y_hat/(1 - y_hat))
max_logit = jnp.clip(logits, 0, None)
bces = logits - logits * y + max_logit + jnp.log(jnp.exp(-max_logit) + jnp.exp(-logits - max_logit))
return jnp.mean(bces)
This works a little better. Now my JAX implementation gets up to 80% train accuracy, but that's still a lot less than the 90% Keras gets. What I want to know is what is going on? Why are my two implementations not behaving the same way?
Below, I condensed my two implementations down to a single script. In this script, I implement the same model in JAX and in Keras. I initialize both with the same weights, and train them using full-batch gradient descent for 10 steps on 1000 datapoints from MNIST, the same data for each model. JAX finishes with 80% training accuracy, while Keras finishes with 90%. Specifically, I get this output:
Initial Keras accuracy: 0.4350000023841858
Initial JAX accuracy: 0.435
Final JAX accuracy: 0.792
Final Keras accuracy: 0.9089999794960022
JAX accuracy (Keras weights): 0.909
Keras accuracy (JAX weights): 0.7919999957084656
And actually, when I vary the conditions a little (using different random initial weights or a different training set), sometimes I get back the 50% JAX accuracy and 90% Keras accuracy.
I swap the weights at the end to verify that the weights obtained from training are indeed the issue, not something to do with the actual computation of the network predictions, or the way I calculate accuracy.
The code:
import numpy as np
import jax
from jax import jit, grad
from jax.experimental import stax, optimizers
import jax.numpy as jnp
import keras
import keras.datasets.mnist
def binary_cross_entropy(y_hat, y):
bce = y * jnp.log(y_hat) + (1 - y) * jnp.log(1 - y_hat)
return jnp.mean(-bce)
def binary_cross_entropy_stable(y_hat, y):
y_hat = jnp.clip(y_hat, 0.000001, 0.9999999)
logits = jnp.log(y_hat/(1 - y_hat))
max_logit = jnp.clip(logits, 0, None)
bces = logits - logits * y + max_logit + jnp.log(jnp.exp(-max_logit) + jnp.exp(-logits - max_logit))
return jnp.mean(bces)
def binary_accuracy(y_hat, y):
return jnp.mean((y_hat >= 1/2) == (y >= 1/2))
########################################
# #
# Create dataset #
# #
########################################
input_dimension = 784
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data(path="mnist.npz")
xs = np.concatenate([x_train, x_test])
xs = xs.reshape((70000, 784))
ys = np.concatenate([y_train, y_test])
ys = (ys >= 5).astype(np.float32)
ys = ys.reshape((70000, 1))
train_xs = xs[:1000]
train_ys = ys[:1000]
########################################
# #
# Create JAX model #
# #
########################################
jax_initializer, jax_model = stax.serial(
stax.Dense(1000),
stax.Relu,
stax.Dense(1),
stax.Sigmoid
)
rng_key = jax.random.PRNGKey(0)
_, initial_jax_weights = jax_initializer(rng_key, (1, input_dimension))
########################################
# #
# Create Keras model #
# #
########################################
initial_keras_weights = [*initial_jax_weights[0], *initial_jax_weights[2]]
keras_model = keras.Sequential([
keras.layers.Dense(1000, activation="relu"),
keras.layers.Dense(1, activation="sigmoid")
])
keras_model.compile(
optimizer=keras.optimizers.SGD(learning_rate=0.01),
loss=keras.losses.binary_crossentropy,
metrics=["accuracy"]
)
keras_model.build(input_shape=(1, input_dimension))
keras_model.set_weights(initial_keras_weights)
if __name__ == "__main__":
########################################
# #
# Compare untrained models #
# #
########################################
initial_keras_predictions = keras_model.predict(train_xs, verbose=0)
initial_jax_predictions = jax_model(initial_jax_weights, train_xs)
_, keras_initial_accuracy = keras_model.evaluate(train_xs, train_ys, verbose=0)
jax_initial_accuracy = binary_accuracy(jax_model(initial_jax_weights, train_xs), train_ys)
print("Initial Keras accuracy:", keras_initial_accuracy)
print("Initial JAX accuracy:", jax_initial_accuracy)
########################################
# #
# Train JAX model #
# #
########################################
L = jit(binary_cross_entropy_stable)
gradL = jit(grad(lambda w, x, y: L(jax_model(w, x), y)))
opt_init, opt_apply, get_params = optimizers.sgd(0.01)
network_state = opt_init(initial_jax_weights)
for _ in range(10):
wT = get_params(network_state)
gradient = gradL(wT, train_xs, train_ys)
network_state = opt_apply(
0,
gradient,
network_state
)
final_jax_weights = get_params(network_state)
final_jax_training_predictions = jax_model(final_jax_weights, train_xs)
final_jax_accuracy = binary_accuracy(final_jax_training_predictions, train_ys)
print("Final JAX accuracy:", final_jax_accuracy)
########################################
# #
# Train Keras model #
# #
########################################
for _ in range(10):
keras_model.fit(
train_xs,
train_ys,
epochs=1,
batch_size=1000,
verbose=0
)
final_keras_loss, final_keras_accuracy = keras_model.evaluate(train_xs, train_ys, verbose=0)
print("Final Keras accuracy:", final_keras_accuracy)
########################################
# #
# Swap weights #
# #
########################################
final_keras_weights = keras_model.get_weights()
final_keras_weights_in_jax_format = [
(final_keras_weights[0], final_keras_weights[1]),
tuple(),
(final_keras_weights[2], final_keras_weights[3]),
tuple()
]
jax_accuracy_with_keras_weights = binary_accuracy(
jax_model(final_keras_weights_in_jax_format, train_xs),
train_ys
)
print("JAX accuracy (Keras weights):", jax_accuracy_with_keras_weights)
final_jax_weights_in_keras_format = [*final_jax_weights[0], *final_jax_weights[2]]
keras_model.set_weights(final_jax_weights_in_keras_format)
_, keras_accuracy_with_jax_weights = keras_model.evaluate(train_xs, train_ys, verbose=0)
print("Keras accuracy (JAX weights):", keras_accuracy_with_jax_weights)
Try changing the PRNG seed at line 57 to a value other than 0 to run the experiment using different initial weights.
Your binary_cross_entropy_stable function does not match the output of keras.binary_crossentropy; for example:
x = np.random.rand(10)
y = np.random.rand(10)
print(keras.losses.binary_crossentropy(x, y))
# tf.Tensor(0.8134677734043875, shape=(), dtype=float64)
print(binary_cross_entropy_stable(x, y))
# 0.9781515
That is where I would start if you're trying to exactly duplicate the model.
You can view the source of the keras loss function here: keras/losses.py#L1765-L1810, with the main part of the implementation here: keras/backend.py#L4972-L5017
One detail: it appears that with a sigmoid activation function, Keras re-uses some cached logits to compute the binary cross entropy while avoiding problematic values: keras/backend.py#L4988-L4997. I'm not sure how to easily replicate that behavior using JAX & stax.
!!!!!!!!!TL;DR at the bottom!!!!!!!!
In an attempt to learn the in's and out's of ML, I have been implementing a neural network optimizer in c++ and wrapped it with swig as a python module. Of course, the first problem I tackled was XOR via the following snip of code: 2 input layers, 2 hidden layers, 1 output layer.
from MikeLearn import NeuralNetwork
from MikeLearn import ClassificationOptimizer
import time
#=======================================================
# Training Set
#=======================================================
X = [[0,1],[1,0],[1,1],[0,0]]
Y = [[1],[1],[0],[0]]
nIn = len(X[0])
nOut = len(Y[0])
#=======================================================
# Model
#=======================================================
verbosity = 0
#Initualize neural network
# NeuralNetwork([nInputs, nHidden1, nHidden2,..,nOutputs],['Activation1','Activation2'...]
N = NeuralNetwork([nIn,2,nOut],['sigmoid','sigmoid'])
N.setLoggerVerbosity(verbosity)
#Initialize the classification optimizer
#ClassificationOptimizer(NeuralNetwork,Xtrain,Ytrain)
Opt = ClassificationOptimizer(N,X,Y)
Opt.setLoggerVerbosity(verbosity)
start_time = time.time();
#fit data
#fit(nEpoch,LearningRate)
E = Opt.fit(10000,0.1)
print("--- %s seconds ---" % (time.time() - start_time))
#Make a prediction
print(Opt.predict(X))
This snippet of code yields the following output (Correct answer would be [1,1,0,0])
--- 0.10273098945617676 seconds ---
((0.9398755431175232,), (0.9397522211074829,), (0.0612373948097229,), (0.04882470518350601,))
>>>
Looks great!
Now for the issue. The following snippet of code tries to learn from the mnist dataset, but suffers very obviously from overfitting. ~750 input (28X28 pixels), 50 hidden, 10 output
from MikeLearn import NeuralNetwork
from MikeLearn import ClassificationOptimizer
import matplotlib.pyplot as plt
import numpy as np
import pickle
import time
#=======================================================
# Data Set
#=======================================================
#load the data dictionary
modeldata = pickle.load( open( "mnist_data.p", "rb" ) )
X = modeldata['X']
Y = modeldata['Y']
#normalize data
X = np.array(X)
X = X/255
X = X.tolist()
#training set
X1 = X[0:49999]
Y1 = Y[0:49999]
#validation set
X2 = X[50000:59999]
Y2 = Y[50000:59999]
#number of inputs/outputs
nIn = len(X[0]) #~750
nOut = len(Y[0]) #=10
#=======================================================
# Model
#=======================================================
verbosity = 1
#Initualize neural network
# NeuralNetwork([nInputs, nHidden1, nHidden2,..,nOutputs],['Activation1','Activation2'...]
N = NeuralNetwork([nIn,50,nOut],['sigmoid','sigmoid'])
N.setLoggerVerbosity(verbosity)
#Initialize optimizer
#ClassificationOptimizer(NeuralNetwork,Xtrain,Ytrain)
Opt = ClassificationOptimizer(N,X1,Y1)
Opt.setLoggerVerbosity(verbosity)
start_time = time.time();
#fit data
#fit(nEpoch,LearningRate)
E = Opt.fit(10,0.1)
print("--- %s seconds ---" % (time.time() - start_time))
#================================
#Final Accuracy on training set
#================================
XL = Opt.predict(X1)
correct = 0
for i,x in enumerate(XL):
if XL[i].index(max(XL[i])) == Y[i].index(max(Y1[i])):
correct = correct + 1
print("Training set Correct = " + str(correct))
Accuracy = correct/len(XL)*100;
print("Accuracy = " + str(Accuracy) + '%')
#================================
#Final Accuracy on validation set
#================================
XL = Opt.predict(X2)
correct = 0
for i,x in enumerate(XL):
if XL[i].index(max(XL[i])) == Y[i].index(max(Y2[i])):
correct = correct + 1
print("Testing set Correct = " + str(correct))
Accuracy = correct/len(XL)*100;
print("Accuracy = " + str(Accuracy)+'%')
That snippet of code yields the following output which shows the training accuracy and validation accuracy.
-------------------------
Epoch
9
-------------------------
E=
0.00696964
E=
0.350509
E=
3.49568e-05
E=
4.09073e-06
E=
1.38491e-06
E=
0.229873
E=
3.60186e-05
E=
0.000115187
E=
2.29978e-06
E=
2.69165e-06
--- 27.400235176086426 seconds ---
Training set Correct = 48435
Accuracy = 96.87193743874877%
Testing set Correct = 982
Accuracy = 9.820982098209821%
The training set accuracy is great, but then the testing set is no better than a random guess. Any idea what could be causing this?
TL;DR
Solved XOR with a model 2 inputs, 2 hidden, 1 output and sigmoid activation functions. Good results.
Tried to solve the Mnist data set with a model of 750 inputs (28X28 pixels), 50 hidden, 10 output and sigmoid activation functions. Severe overfitting issue. 95% accuracy on the training set, 10% accuracy on validation set.
Any Idea what is causing this?
The cause of overfitting is a combination of the data and model (network in this case). During the training is was 'lazy' and found aspects of the data that worked well in training data but not generalize well.
It is difficult/impossible to point out exactly where in the trained network the nodes/weights are located that are responsible for overfitting.
But we can avoid overfitting with several tricks:
Regularisation
Drop-out (easier to implement)
Change Network Architecture (less layers/less nodes/more dimension-reduction)
https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/
To get an idea of regularization, try the playground from tensorflow:
https://playground.tensorflow.org/
A visualisation of dropout
https://yusugomori.com/projects/deep-learning/dropout-relu
Besides try out regularisation techniques, also experiments with different NN architectures.
For a complex text classification task I am training an RNN model. Somehow the weights of the RNN only change at the beginning and then with very tiny steps (gradients are in the range of e^-7):
For those of you who are not familiar with tensorboard: This shows the distribution of values for the bias and the weights of the RNN (x axis is the value, y axis the number of values and z the training iterations).
I constructed a toy example which does not make sense but reproduces the same behaviour:
import numpy as np
import tensorflow as tf
tensorboard_save_path = "../RNN/tensorboard/supersimple/"
x = np.random.normal(size=(33, 20, 5000))
y = np.array([1 if i>0.5 else 0 for i in np.random.random(33)])
##### NETWORK #########
with tf.name_scope("RNN"):
rnn_cell = tf.contrib.rnn.BasicRNNCell(1)
outputs, states = tf.nn.dynamic_rnn(rnn_cell, x, dtype=tf.float64)
rnn_weights, rnn_biases = rnn_cell.variables
tf.summary.histogram("RNN weights", rnn_weights)
tf.summary.histogram("RNN biases", rnn_biases)
pred = tf.sigmoid(outputs[:,-1])
with tf.name_scope("cost"):
cost = tf.losses.mean_squared_error(predictions=pred, labels=np.reshape(y, (33,1)))
with tf.name_scope("train"):
optimizer = tf.train.AdagradOptimizer(learning_rate=0.5).minimize(cost)
init = tf.global_variables_initializer()
merged_summary = tf.summary.merge_all()
writer = tf.summary.FileWriter(tensorboard_save_path)
print("\ttensorboard --logdir=" + tensorboard_save_path)
sess = tf.Session()
sess.run(init)
for i in range(1000):
sess.run([optimizer])
if i % 50 == 0:
c, s = sess.run([cost, merged_summary])
writer.add_summary(s, i)
print("cost is %f" % c)
Expected behaviour would be that the model overfits due to the huge amount of variables in contrast to the few training samples. Any idea what's going wrong here?
Maybe my question will seem stupid.
I'm studying the Q-learning algorithm. In order to better understand it, I'm trying to remake the Tenzorflow code of this FrozenLake example into the Keras code.
My code:
import gym
import numpy as np
import random
from keras.layers import Dense
from keras.models import Sequential
from keras import backend as K
import matplotlib.pyplot as plt
%matplotlib inline
env = gym.make('FrozenLake-v0')
model = Sequential()
model.add(Dense(16, activation='relu', kernel_initializer='uniform', input_shape=(16,)))
model.add(Dense(4, activation='softmax', kernel_initializer='uniform'))
def custom_loss(yTrue, yPred):
return K.sum(K.square(yTrue - yPred))
model.compile(loss=custom_loss, optimizer='sgd')
# Set learning parameters
y = .99
e = 0.1
#create lists to contain total rewards and steps per episode
jList = []
rList = []
num_episodes = 2000
for i in range(num_episodes):
current_state = env.reset()
rAll = 0
d = False
j = 0
while j < 99:
j+=1
current_state_Q_values = model.predict(np.identity(16)[current_state:current_state+1], batch_size=1)
action = np.reshape(np.argmax(current_state_Q_values), (1,))
if np.random.rand(1) < e:
action[0] = env.action_space.sample() #random action
new_state, reward, d, _ = env.step(action[0])
rAll += reward
jList.append(j)
rList.append(rAll)
new_Qs = model.predict(np.identity(16)[new_state:new_state+1], batch_size=1)
max_newQ = np.max(new_Qs)
targetQ = current_state_Q_values
targetQ[0,action[0]] = reward + y*max_newQ
model.fit(np.identity(16)[current_state:current_state+1], targetQ, verbose=0, batch_size=1)
current_state = new_state
if d == True:
#Reduce chance of random action as we train the model.
e = 1./((i/50) + 10)
break
print("Percent of succesful episodes: " + str(sum(rList)/num_episodes) + "%")
When I run it, it doesn't work well: Percent of succesful episodes: 0.052%
plt.plot(rList)
The original Tensorflow code is much more better: Percent of succesful episodes: 0.352%
plt.plot(rList)
What have I done wrong ?
Besides setting use_bias=False as #Maldus mentioned in the comments, another thing you can try is to start with a higher epsilon value (e.g. 0.5, 0.75)? A trick might be to only decrease the epsilon value IF you reach the goal. i.e. don't decrease epsilon on the end of every episode. That way your player can keep on exploring the map randomly, until it starts to converge on a good route, and then it'll be a good idea to reduce the epsilon parameter.
I've actually implemented a similar model in keras in this gist using Convolutional layers instead of Dense layers. Managed to get it to work in under 2000 episodes. Might be of some help to others :)
I am aiming to do big things with TensorFlow, but I'm trying to start small.
I have small greyscale squares (with a little noise) and I want to classify them according to their colour (e.g. 3 categories: black, grey, white). I wrote a little Python class to generate squares, and 1-hot vectors, and modified their basic MNIST example to feed them in.
But it won't learn anything - e.g. for 3 categories it always guesses ≈33% correct.
import tensorflow as tf
import generate_data.generate_greyscale
data_generator = generate_data.generate_greyscale.GenerateGreyScale(28, 28, 3, 0.05)
ds = data_generator.generate_data(10000)
ds_validation = data_generator.generate_data(500)
xs = ds[0]
ys = ds[1]
num_categories = data_generator.num_categories
x = tf.placeholder("float", [None, 28*28])
W = tf.Variable(tf.zeros([28*28, num_categories]))
b = tf.Variable(tf.zeros([num_categories]))
y = tf.nn.softmax(tf.matmul(x,W) + b)
y_ = tf.placeholder("float", [None,num_categories])
cross_entropy = -tf.reduce_sum(y_*tf.log(y))
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
# let batch_size = 100 --> therefore there are 100 batches of training data
xs = xs.reshape(100, 100, 28*28) # reshape into 100 minibatches of size 100
ys = ys.reshape((100, 100, num_categories)) # reshape into 100 minibatches of size 100
for i in range(100):
batch_xs = xs[i]
batch_ys = ys[i]
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
xs_validation = ds_validation[0]
ys_validation = ds_validation[1]
print sess.run(accuracy, feed_dict={x: xs_validation, y_: ys_validation})
My data generator looks like this:
import numpy as np
import random
class GenerateGreyScale():
def __init__(self, num_rows, num_cols, num_categories, noise):
self.num_rows = num_rows
self.num_cols = num_cols
self.num_categories = num_categories
# set a level of noisiness for the data
self.noise = noise
def generate_label(self):
lab = np.zeros(self.num_categories)
lab[random.randint(0, self.num_categories-1)] = 1
return lab
def generate_datum(self, lab):
i = np.where(lab==1)[0][0]
frac = float(1)/(self.num_categories-1) * i
arr = np.random.uniform(max(0, frac-self.noise), min(1, frac+self.noise), self.num_rows*self.num_cols)
return arr
def generate_data(self, num):
data_arr = np.zeros((num, self.num_rows*self.num_cols))
label_arr = np.zeros((num, self.num_categories))
for i in range(0, num):
label = self.generate_label()
datum = self.generate_datum(label)
data_arr[i] = datum
label_arr[i] = label
#data_arr = data_arr.astype(np.float32)
#label_arr = label_arr.astype(np.float32)
return data_arr, label_arr
For starters, try initializing your W matrix with random values, not zeros - you're not giving the optimizer anything to work with when the output is all zeros for all inputs.
Instead of:
W = tf.Variable(tf.zeros([28*28, num_categories]))
Try:
W = tf.Variable(tf.truncated_normal([28*28, num_categories],
stddev=0.1))
You issue is that the your gradients are increasing/decreasing without bounds, causing the loss function to become nan.
Take a look at this question: Why does TensorFlow example fail when increasing batch size?
Furthermore, make sure that you run the model for a sufficient number of steps. You are only running it once through your train dataset (100 times * 100 examples), and this is not enough for it to converge. Increase it to something like 2000 at a minimum (running 20 times through your dataset).
Edit (can't comment, so i'll add my thoughts here):
The point of the post i linked is that you can use GradientDescentOptimizer, as long as you make the learning rate something like 0.001. That's the issue, your learning rate was too high for the loss function you were using.
Alternatively, use a different loss function, that doesn't increase/decrease the gradients as much. Use tf.reduce_mean instead of tf.reduce_sum in the definition of crossEntropy.
While dga and syncd's responses were helpful, I tried using non-zero weight initialization and larger datasets but to no avail. The thing that finally worked was using a different optimization algorithm.
I replaced:
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
with
train_step = tf.train.AdamOptimizer(0.0005).minimize(cross_entropy)
I also embedded the training for loop in another for loop to train for several epochs, resulting in convergence like this:
===# EPOCH 0 #===
Error: 0.370000004768
===# EPOCH 1 #===
Error: 0.333999991417
===# EPOCH 2 #===
Error: 0.282000005245
===# EPOCH 3 #===
Error: 0.222000002861
===# EPOCH 4 #===
Error: 0.152000010014
===# EPOCH 5 #===
Error: 0.111999988556
===# EPOCH 6 #===
Error: 0.0680000185966
===# EPOCH 7 #===
Error: 0.0239999890327
===# EPOCH 8 #===
Error: 0.00999999046326
===# EPOCH 9 #===
Error: 0.00400000810623
EDIT - WHY IT WORKS: I suppose the problem was that I didn't manually choose a good learning rate schedule, and Adam was able to generate a better one automatically.
Found this question when I was having a similar issue..I fixed mine by scaling the features.
A little background: I was following the tensorflow tutorial, however I wanted to use the data from Kaggle(see data here) to do the modeling, but in the beginning I kepted having the same issue: the model just doesn't learn..after rounds of trouble-shooting, I realized that the Kaggle data was on a completely different scale. Therefore, I scaled the data so that it shares the same scale(0,1) as the tensorflow's MNIST dataset.
Just figured that I would add my two cents here..in case some beginners who are trying to follow the tutorial's settings get stuck like I did =)