I have a simple structure that I learned from a video from Siraj Raval of a single layer perceptron in tensorflow. I was trying to extend it to a larger number of layers and I am having difficulty.
The first example is 2 inputs and 2 outputs, where weights and biases are applied once and then the softmax function is applied to the output.
The second example is 2 inputs and 2 outputs with a hidden layer (2 units) in between, so there are two sets of weights and biases and the softmax function is applied after each of them.
I'm trying to extend the simple case to an N-hidden layer case, but am having limited success as when I add extra layers, they seem to be ignored by the optimizer.
Input is of the form:
inputX = np.array([[ 2.10400000e+03, 3.00000000e+00],
[ 1.60000000e+03, 3.00000000e+00],
[ 2.40000000e+03, 3.00000000e+00],
[ 1.41600000e+03, 2.00000000e+00],
[ 3.00000000e+03, 4.00000000e+00],
[ 1.98500000e+03, 4.00000000e+00],
[ 1.53400000e+03, 3.00000000e+00],
[ 1.42700000e+03, 3.00000000e+00],
[ 1.38000000e+03, 3.00000000e+00],
[ 1.49400000e+03, 3.00000000e+00]])
And output labels are of the form:
inputY = np.array([[1, 0],
[1, 0],
[1, 0],
[0, 1],
[0, 1],
[1, 0],
[0, 1],
[1, 0],
[1, 0],
[1, 0]])
A snippet of my code which executes correctly (dependencies are numpy and tensorflow):
#input and output placeholder, feed data to x, feed labels to y_
x = tf.placeholder(tf.float32, [None, 2])
y_ = tf.placeholder(tf.float32, [None, 2])
#first layer weights and biases
W = tf.Variable(tf.zeros([2,2]))
b = tf.Variable(tf.zeros([2]))
# vector form of x*W + b
y_values = tf.add(tf.matmul(x, W), b)
#activation function
y = tf.nn.softmax(y_values)
cost = tf.reduce_sum(tf.pow(y_ - y, 2))/(n_samples) #sum of squared errors
optimizer = tf.train.AdamOptimizer(alpha).minimize(cost)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
for i in range(training_epochs):
sess.run(optimizer, feed_dict = {x: inputX, y_:inputY})
#log training
if i % display_step == 0:
cc = sess.run(cost, feed_dict = {x: inputX, y_:inputY})
print("Training step:", '%04d' % (i), "cost=", "{:.9f}".format(cc))
print("Optimization Finished!")
training_cost = sess.run(cost, feed_dict = {x: inputX, y_: inputY})
print("Training cost = ", training_cost, "\nW=", sess.run(W), "\nb=", sess.run(b))
#check what it thinks when you give it the input data
print(sess.run(y, feed_dict = {x:inputX}))
I get the output of:
W= [[ 0.00021142 -0.00021142]
[ 0.00120122 -0.00120122]]
b= [ 0.00103542 -0.00103542]
label_predictions = [[ 0.71073025 0.28926972]
[ 0.66503692 0.33496314]
[ 0.73576927 0.2642307 ]
[ 0.64694035 0.35305965]
[ 0.78248388 0.21751612]
[ 0.70078063 0.2992194 ]
[ 0.65879178 0.34120819]
[ 0.6485498 0.3514502 ]
[ 0.64400673 0.3559933 ]
[ 0.65497971 0.34502029]]
Not great, so I wanted to try to increase the number of layers to see if it would improve things.
I added an extra layer by using new variables of W2, b2 and hidden_layer:
#input and output placeholder, feed data to x, feed labels to y_
x = tf.placeholder(tf.float32, [None, 2])
y_ = tf.placeholder(tf.float32, [None, 2])
#first layer weights and biases
W = tf.Variable(tf.zeros([2,2]))
b = tf.Variable(tf.zeros([2]))
#second layer weights and biases
W2 = tf.Variable(tf.zeros([2,2]))
b2 = tf.Variable(tf.zeros([2]))
#flow through first layer
hidden_layer = tf.add(tf.matmul(x, W), b)
hidden_layer = tf.nn.softmax(hidden_layer)
#flow through second layer
y_values = tf.add(tf.matmul(hidden_layer, W2), b2)
y = tf.nn.softmax(y_values)
cost = tf.reduce_sum(tf.pow(y_ - y, 2))/(n_samples) #sum of squared errors
optimizer = tf.train.AdamOptimizer(alpha).minimize(cost)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
for i in range(training_epochs):
sess.run(optimizer, feed_dict = {x: inputX, y_:inputY})
#log training
if i % display_step == 0:
cc = sess.run(cost, feed_dict = {x: inputX, y_:inputY})
print("Training step:", '%04d' % (i), "cost=", "{:.9f}".format(cc))
print("Optimization Finished!")
training_cost = sess.run(cost, feed_dict = {x: inputX, y_: inputY})
print("Training cost = ", training_cost, "\nW=", sess.run(W), "\nW2=", sess.run(W2),\
"\nb=", sess.run(b), "\nb2=", sess.run(b2))
#check what it thinks when you give it the input data
print(sess.run(y, feed_dict = {x:inputX}))
I'm then told that my first layer weights and biases are all zeros and that the predictions are now roughly about half and half on every training example, much worse than before.
output:
W= [[ 0. 0.]
[ 0. 0.]]
W2= [[ 0.00199614 -0.00199614]
[ 0.00199614 -0.00199614]]
b= [ 0. 0.]
b2= [ 0.00199614 -0.00199614]
label_predictions = [[ 0.5019961 0.49800384]
[ 0.5019961 0.49800384]
[ 0.5019961 0.49800384]
[ 0.5019961 0.49800384]
[ 0.5019961 0.49800384]
[ 0.5019961 0.49800384]
[ 0.5019961 0.49800384]
[ 0.5019961 0.49800384]
[ 0.5019961 0.49800384]
[ 0.5019961 0.49800384]]
Why is only one layer of weights and biases being affected? Why isn't adding a layer improving the model?
I have a few suggestions in order to improve the performance of your model:
1.) Randomly initialized variables often work better than zeros, at least for the matrix elements. You could try normally distributed variables.
2.) You should normalize your input data, since the two columns are of different order of magnitude. In principle, this should not be a problem, since the weights can be adjusted differently, but with random initialization it is probable that the network will pay attention only to the first column. If you normalize the data, both columns will be of the same order of magnitude.
3.) Maybe you should increase the number of neurons in the hidden layer to a value of about 10.
With these modifications, it worked quite well for me. I've posted a complete working example below:
import tensorflow as tf
import numpy as np
alpha = 0.02
training_epochs = 20000
display_step = 2000
inputX = np.array([[ 2.10400000e+03, 3.00000000e+00],
[ 1.60000000e+03, 3.00000000e+00],
[ 2.40000000e+03, 3.00000000e+00],
[ 1.41600000e+03, 2.00000000e+00],
[ 3.00000000e+03, 4.00000000e+00],
[ 1.98500000e+03, 4.00000000e+00],
[ 1.53400000e+03, 3.00000000e+00],
[ 1.42700000e+03, 3.00000000e+00],
[ 1.38000000e+03, 3.00000000e+00],
[ 1.49400000e+03, 3.00000000e+00]])
n_samples = inputX.shape[0]
# Normalize input data
means = np.mean(inputX, axis=0)
stddevs = np.std(inputX, axis=0)
inputX[:,0] = (inputX[:,0] - means[0]) / stddevs[0]
inputX[:,1] = (inputX[:,1] - means[1]) / stddevs[1]
# Define target labels
inputY = np.array([[1, 0],
[1, 0],
[1, 0],
[0, 1],
[0, 1],
[1, 0],
[0, 1],
[1, 0],
[1, 0],
[1, 0]])
#input and output placeholder, feed data to x, feed labels to y_
x = tf.placeholder(tf.float32, [None, 2])
y_ = tf.placeholder(tf.float32, [None, 2])
#first layer weights and biases
W = tf.Variable(tf.random_normal([2,10], stddev=1.0/tf.sqrt(2.0)))
b = tf.Variable(tf.zeros([10]))
#second layer weights and biases
W2 = tf.Variable(tf.random_normal([10,2], stddev=1.0/tf.sqrt(2.0)))
b2 = tf.Variable(tf.zeros([2]))
#flow through first layer
hidden_layer = tf.add(tf.matmul(x, W), b)
hidden_layer = tf.nn.softmax(hidden_layer)
#flow through second layer
y_values = tf.add(tf.matmul(hidden_layer, W2), b2)
y = tf.nn.softmax(y_values)
cost = tf.reduce_sum(tf.pow(y_ - y, 2))/(n_samples) #sum of squared errors
optimizer = tf.train.AdamOptimizer(alpha).minimize(cost)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
for i in range(training_epochs):
sess.run(optimizer, feed_dict = {x: inputX, y_:inputY})
#log training
if i % display_step == 0:
cc = sess.run(cost, feed_dict = {x: inputX, y_:inputY})
#check what it thinks when you give it the input data
print(sess.run(y, feed_dict = {x:inputX}))
print("Training step:", '%04d' % (i), "cost=", "{:.9f}".format(cc))
print("Optimization Finished!")
training_cost = sess.run(cost, feed_dict = {x: inputX, y_: inputY})
print("Training cost = ", training_cost, "\nW=", sess.run(W), "\nW2=", sess.run(W2),\
"\nb=", sess.run(b), "\nb2=", sess.run(b2))
The output looks very much like the labels:
[[ 1.00000000e+00 2.48446125e-10]
[ 9.99883890e-01 1.16143732e-04]
[ 1.00000000e+00 2.48440435e-10]
[ 1.65703295e-05 9.99983430e-01]
[ 6.65045518e-05 9.99933481e-01]
[ 9.99985337e-01 1.46147468e-05]
[ 1.69444829e-04 9.99830484e-01]
[ 1.00000000e+00 6.85981003e-12]
[ 1.00000000e+00 2.05180339e-12]
[ 9.99865890e-01 1.34040893e-04]]
Related
I want to classify
if input data is under 200 than output is (0, 1)
and if input data is over 200 than output is (1, 0)
input value is sequential integer value and layer is 5.
hidden layer use sigmoid and last hidden layer use softmax function
loss function is reduce_mean and training with gradient descendent
import numpy as np
import tensorflow as tf
def set_x_data():
x_data = np.array([[50]
, [60]
, [70]
, [80]
, [90]
, [110]
, [120]
, [130]
, [140]
, [150]
, [160]
, [170]
, [180]
, [190]
, [200]
, [210]
, [220]
, [230]
, [240]
, [250]
, [260]
, [270]
, [280]
, [290]
, [300]
, [310]
, [320]
, [330]
, [340]
, [350]
, [360]
, [370]
, [380]
, [390]])
return x_data
def set_y_data(x):
y_data = np.array([[0, 1]
, [0, 1]
, [0, 1]
, [0, 1]
, [0, 1]
, [0, 1]
, [0, 1]
, [0, 1]
, [0, 1]
, [0, 1]
, [0, 1]
, [0, 1]
, [0, 1]
, [0, 1]
, [0, 1]
, [0, 1]
, [1, 0]
, [1, 0]
, [1, 0]
, [1, 0]
, [1, 0]
, [1, 0]
, [1, 0]
, [1, 0]
, [1, 0]
, [1, 0]
, [1, 0]
, [1, 0]
, [1, 0]
, [1, 0]
, [1, 0]
, [1, 0]
, [1, 0]
, [1, 0]])
return y_data
def set_bias(efficiency):
arr = np.array([efficiency])
return arr
W1 = tf.Variable(tf.random_normal([1, 5]), name='weight1')
W2 = tf.Variable(tf.random_normal([5, 5]), name='weight2')
W3 = tf.Variable(tf.random_normal([5, 5]), name='weight3')
W4 = tf.Variable(tf.random_normal([5, 5]), name='weight4')
W5 = tf.Variable(tf.random_normal([5, 2]), name='weight5')
def inference(input, b):
hidden_layer1 = tf.sigmoid(tf.matmul(input, W1) + b)
hidden_layer2 = tf.sigmoid(tf.matmul(hidden_layer1, W2) + b)
hidden_layer3 = tf.sigmoid(tf.matmul(hidden_layer2, W3) + b)
hidden_layer4 = tf.sigmoid(tf.matmul(hidden_layer3, W4) + b)
out_layer = tf.nn.softmax(tf.matmul(hidden_layer4, W5) + b)
return out_layer
def loss(hypothesis, y):
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y * tf.log(hypothesis), reduction_indices=[1]))
return cross_entropy
def train(loss):
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
train = optimizer.minimize(loss)
return train
x_data = set_x_data(1)
y_data = set_y_data(0)
b_data = set_bias(0.8)
x= tf.placeholder(tf.float32, shape=[None, 1])
y= tf.placeholder(tf.float32, shape=[None, 2])
b = tf.placeholder(tf.float32, shape=[None])
hypothesis = inference(x, b)
loss = loss(hypothesis, y)
train = train(loss)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
print(sess.run(W1))
for step in range(2000):
sess.run(train, feed_dict={x:x_data, y:y_data, b:b_data})
print(sess.run(W1))
print(sess.run(hypothesis, feed_dict={x:np.array([[1000]]), b:b_data}))
when I print W1 before training and after training, value doesn't change specially and testing when input = 1000, that value doesn't currect what I expect. I think value nearly close to (1, 0), but result is almost (0.5, 0.5)
I guess that mistakes come from loss function because it was copied from here and there, but I can't be sure about it
upper code is just simplified of my code but I think I have to show my real code
the code is too long so I create new post
classifying data by tensorflow but accuracy value didn't change
There are a few issues in the training of the above network, but with a few changes you can achieve a network that gets this decision function
(The plot in the link shows the score of class 2, i.e. if x > 200)
The list of issues subject to improvement in this network:
The training data is very scarce (only 34 points!) This is typically too small, especially for a 5-layer network as in your case. You typically want many more input samples than parameters in the network. Try adding more input values and reducing the number of layers (as in the code below - I've used floats instead of integers to get more points, but I think it is still compatible).
The input ranges typically require scaling (below I've tried a super-simple scaling by dividing by a constant). This is because you typically want to avoid high ranges of variables (especially of you pass many layers with a soft-max non-linearity, this would destroy the information contained in the very high or very low values). In more advanced cases you might want to do Min-Max Scaling or z-scores.
Try more epochs (and try plotting the evolution of the loss function value). With the given number of epochs, the optimization of the loss function had not converged. Below I do 10x more epochs. See how the code below now almost converges in this plot (and see how 2000 epochs were not enough):
Something that helped was shuffling the (x,y) data. Though this is not crucial in this case, it converges faster (see the paper "Efficient Backprop" by Le Cun). And in more serious examples it is typically needed.
Importantly, I think you want b to be a parameter, not a constant, don't you? The bias of a network is typically also optimized together with the multiplicative weights. (Also, it is not common to use a single, shared bias for all the hidden layers. )
Below is the code. Note there might be further improvements but these few tricks end up with the desired decision function.
I've added some inline comments to indicate changes with respect to the original. I hope you find these pieces of advice insightful!
The code:
import numpy as np
import tensorflow as tf
# I've modified the functions set_x_data and set_y_data
# so as to generate a larger set of numbers.
# Generate a range of numbers from 50 to 390
def set_x_data():
x_data = np.arange(50, 390, 0.1)
return x_data[:,None]
# Assign labels depending on x_data
def set_y_data(x_data):
ydata1 = x_data >= 200
ydata2 = x_data < 200
return np.hstack((ydata1, ydata2))
def set_bias(efficiency):
arr = np.array([efficiency])
return arr
# Let's keep W1 and W5 (one hidden layer only)
# BTW, in this problem you could do with 0 hidden layers. But keeping
# 1 to show it works
W1 = tf.Variable(tf.random_normal([1, 5]), name='weight1')
W5 = tf.Variable(tf.random_normal([5, 2]), name='weight5')
# BTW, b should be a parameter, too.
b = tf.Variable(tf.constant(0.0))
# Just keeping 1 hidden layer
def inference(input):
hidden_layer1 = tf.sigmoid(tf.matmul(input, W1) + b)
out_layer = tf.nn.softmax(tf.matmul(hidden_layer1, W5) + b)
return out_layer
# This is unchanged
def loss(hypothesis, y):
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y * tf.log(hypothesis), reduction_indices=[1]))
return cross_entropy
# This is unchanged
def train(loss):
optimizer =
tf.train.GradientDescentOptimizer(learning_rate=0.1)
train = optimizer.minimize(loss)
return train
# Using SCALE to normalize the input variables (range of inputs too big)
# This is a simple normalization in this case. Other examples are
# Min-Max normalization or z-scores.
SCALE = 1000
x_data = set_x_data()
y_data = set_y_data(x_data)
x_data /= SCALE
# Now only placeholders are x and y (b is a parameter)
x= tf.placeholder(tf.float32, shape=[None, 1])
y= tf.placeholder(tf.float32, shape=[None, 2])
hypothesis = inference(x)
loss = loss(hypothesis, y)
train = train(loss)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
print(sess.run(W1))
# Epochs x 10, it did not converge with fewer epochs
epochs = 20000
losses = np.zeros(epochs)
for step in range(epochs):
# Shuffle data
r = np.random.permutation(x_data.shape[0])
x_data = x_data[r]
y_data = y_data[r,:]
# Small modification here to capture the loss.
_, l = sess.run([train, loss], feed_dict={x:x_data, y:y_data})
losses[step] = l
print(sess.run(W1))
print(sess.run(b))
The code to display the decision function above:
%matplotlib inline
import matplotlib.pyplot as plt
ystar = np.arange(50, 400, 10)[:,None]
plt.plot(ystar, sess.run(hypothesis, feed_dict={x:ystar/SCALE})[:,0])
I've recently installed tensorflow on my computer, but I'm confused about some results I'm getting from the first tutorial program. It's a very simple linear regression model that finds W and b for W*x + b = y:
import tensorflow as tf
# Model parameters
W = tf.Variable([.3], dtype=tf.float32)
b = tf.Variable([-.3], dtype=tf.float32)
# Model input and output
x = tf.placeholder(tf.float32)
linear_model = W*x + b
y = tf.placeholder(tf.float32)
# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
# training data
x_train = [1, 2, 3, 4]
y_train = [0, -1, -2, -3]
# training loop
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init) # reset values to wrong
for i in range(1000):
sess.run(train, {x: x_train, y: y_train})
# evaluate training accuracy
curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x: x_train, y: y_train})
print("W: %s b: %s loss: %s"%(curr_W, curr_b, curr_loss))
result:
W: [-0.9999969] b: [ 0.99999082] loss: 5.69997e-11
It works!
But then I changed the training data from:
x_train = [1, 2, 3, 4]
y_train = [0, -1, -2, -3]
to:
x_train = [145, 146, 147, 148]
y_train = [151, 152, 153, 154]
I should theoretically get W: [~1] b: [~6] loss: ~0, but instead I get:
W: [ nan] b: [ nan] loss: nan
Below is a print of i, W, b, and loss after 10 iterations of training
[0, '0.3', '-0.3', '4.74e+04']
[1, '1276', '8.408', '1.396e+11']
[2, '-2.188e+06', '-1.494e+04', '4.111e+17']
[3, '3.755e+09', '2.563e+07', '1.211e+24']
[4, '-6.445e+12', '-4.399e+10', '3.566e+30']
[5, '1.106e+16', '7.549e+13', '1.05e+37']
[6, '-1.898e+19', '-1.296e+17', 'inf']
[7, '3.257e+22', '2.223e+20', 'inf']
[8, '-5.59e+25', '-3.816e+23', 'inf']
[9, '9.594e+28', '6.548e+26', 'inf']
Does anyone know what could be causing this? I'm using Tensorflow 1.4.0 (CPU only) with Python 3.5.2 on Ubuntu 16.04
EDIT: normalizing the data helped, thanks!
Turn down your learning rate
normalize your train data like (x-mean)/std.
I recommend second.You can have a try.
Given the DNN (simple case of multilayered perceptron) with 2 hidden layers of 5 and 3 dimensions respectively, I am training a model to recognize the OR gate.
Using tensorflow learn, it seems like it's giving me the reverse output and I have no idea why:
from tensorflow.contrib import learn
classifier = learn.DNNClassifier(hidden_units=[5, 3], n_classes=2)
or_input = np.array([[0.,0.], [0.,1.], [1.,0.]])
or_output = np.array([[0,1,1]]).T
classifier.fit(or_input, or_output, steps=0.05, batch_size=3)
classifier.predict(np.array([ [1., 1.], [1., 0.] , [0., 0.] , [0., 1.]]))
[out]:
array([0, 0, 1, 0])
If I'm doing it "old-school", without the tensorflow.learn as follows, I get the expected answer.
import tensorflow as tf
# Parameters
learning_rate = 1.0
num_epochs = 1000
# Network Parameters
input_dim = 2 # Input dimensions.
hidden_dim_1 = 5 # 1st layer number of features
hidden_dim_2 = 3 # 2nd layer number of features
output_dim = 1 # Output dimensions.
# tf Graph input
x = tf.placeholder("float", [None, input_dim])
y = tf.placeholder("float", [hidden_dim_2, output_dim])
# With biases.
weights = {
'syn0': tf.Variable(tf.random_normal([input_dim, hidden_dim_1])),
'syn1': tf.Variable(tf.random_normal([hidden_dim_1, hidden_dim_2])),
'syn2': tf.Variable(tf.random_normal([hidden_dim_2, output_dim]))
}
biases = {
'b0': tf.Variable(tf.random_normal([hidden_dim_1])),
'b1': tf.Variable(tf.random_normal([hidden_dim_2])),
'b2': tf.Variable(tf.random_normal([output_dim]))
}
# Create a model
def multilayer_perceptron(X, weights, biases):
# Hidden layer 1 + sigmoid activation function
layer_1 = tf.add(tf.matmul(X, weights['syn0']), biases['b0'])
layer_1 = tf.nn.sigmoid(layer_1)
# Hidden layer 2 + sigmoid activation function
layer_2 = tf.add(tf.matmul(layer_1, weights['syn1']), biases['b1'])
layer_2 = tf.nn.sigmoid(layer_2)
# Output layer
out_layer = tf.matmul(layer_2, weights['syn2']) + biases['b2']
out_layer = tf.nn.sigmoid(out_layer)
return out_layer
# Construct model
pred = multilayer_perceptron(x, weights, biases)
# Define loss and optimizer
cost = tf.sub(y, pred)
# Or you can use fancy cost like:
##tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
init = tf.initialize_all_variables()
or_input = np.array([[0.,0.], [0.,1.], [1.,0.]])
or_output = np.array([[0.,1.,1.]]).T
# Launch the graph
with tf.Session() as sess:
sess.run(init)
# Training cycle
for epoch in range(num_epochs):
batch_x, batch_y = or_input, or_output # Loop over all data points.
# Run optimization op (backprop) and cost op (to get loss value)
_, c = sess.run([optimizer, cost], feed_dict={x: batch_x, y: batch_y})
#print (c)
# Now let's test it on the unknown dataset.
new_inputs = np.array([[1.,1.], [1.,0.]])
feed_dict = {x: new_inputs}
predictions = sess.run(pred, feed_dict)
print (predictions)
[out]:
[[ 0.99998868]
[ 0.99998868]]
Why is it that I am getting the reversed output using tensorflow.learn? Am I doing something wrongly using the tensorflow.learn?
How do I get the tensorflow.learn code to produce the same output as the "old-school" tensorflow framework?
If you specify the right argument for steps you get the good results:
classifier.fit(or_input, or_output, steps=1000, batch_size=3)
Result:
array([1, 1, 0, 1])
How does steps work
The steps argument specifies the number of times you run the training operation. Let me give you some examples:
with batch_size = 16 and steps = 10, you will see a total of 160 examples
in your example, batch_size = 3 and steps = 1000, the algorithm will see 3000 examples. In fact, it will see 1000 times the same 3 examples you provided
So, steps is not the number of epochs, it is the number of times you run the training op, or the number of times you see a new batch.
Why is steps = 0.05 allowed?
In the tf.learn code, they don't check if steps is an integer. They just run a while loop checking that (at this line):
last_step < max_steps
So if max_steps = 0.05, it will behave the same as if max_steps = 1 (last_step is incremented in the loop).
I'm trying to implement an asynchronous parameter server, DistBelief style using TensorFlow. I found that minimize() is split into two functions, compute_gradients and apply_gradients, so my plan is to insert a network boundary between them. I have a question about how to evaluate all the gradients simultaneously and pull them out all at once. I understand that eval only evaluates the subgraph necessary, but it also only returns one tensor, not the chain of tensors required to compute that tensor.
How can I do this more efficiently? I took the Deep MNIST example as a starting point:
import tensorflow as tf
import download_mnist
def weight_variable(shape, name):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial, name=name)
def bias_variable(shape, name):
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial, name=name)
def conv2d(x, W):
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
def max_pool_2x2(x):
return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1], padding='SAME')
mnist = download_mnist.read_data_sets('MNIST_data', one_hot=True)
session = tf.InteractiveSession()
x = tf.placeholder("float", shape=[None, 784], name='x')
x_image = tf.reshape(x, [-1,28,28,1], name='reshape')
y_ = tf.placeholder("float", shape=[None, 10], name='y_')
W_conv1 = weight_variable([5, 5, 1, 32], 'W_conv1')
b_conv1 = bias_variable([32], 'b_conv1')
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)
W_conv2 = weight_variable([5, 5, 32, 64], 'W_conv2')
b_conv2 = bias_variable([64], 'b_conv2')
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)
W_fc1 = weight_variable([7 * 7 * 64, 1024], 'W_fc1')
b_fc1 = bias_variable([1024], 'b_fc1')
h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
keep_prob = tf.placeholder("float", name='keep_prob')
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)
W_fc2 = weight_variable([1024, 10], 'W_fc2')
b_fc2 = bias_variable([10], 'b_fc2')
y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)
loss = -tf.reduce_sum(y_ * tf.log(y_conv))
optimizer = tf.train.AdamOptimizer(1e-4)
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
compute_gradients = optimizer.compute_gradients(loss)
session.run(tf.initialize_all_variables())
batch = mnist.train.next_batch(50)
feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5}
gradients = []
for grad_var in compute_gradients:
grad = grad_var[0].eval(feed_dict=feed_dict)
var = grad_var[1]
gradients.append((grad, var))
I think this last for loop is actually recalculating the last gradient several times, whereas the first gradient is computed only once? How can I grab all the gradients without recomputing them?
Just give you a simple example. Understand it and try your specific task out.
Initialize required symbols.
x = tf.Variable(0.5)
y = x*x
opt = tf.train.AdagradOptimizer(0.1)
grads = opt.compute_gradients(y)
grad_placeholder = [(tf.placeholder("float", shape=grad[1].get_shape()), grad[1] for grad in grads]
apply_placeholder_op = opt.apply_gradients(grad_placeholder)
transform_grads = [(function1(grad[0]), grad[1]) for grad in grads]
apply_transform_op = opt.apply_gradients(transform_grads)
Initialize
sess = tf.Session()
sess.run(tf.initialize_all_variables())
Get all gradients
grad_vals = sess.run([grad[0] for grad in grads])
Apply gradients
feed_dict = {}
for i in xrange(len(grad_placeholder)):
feed_dict[grad_placeholder[i][0]] = function2(grad_vals[i])
sess.run(apply_placeholder_op, feed_dict=feed_dict)
sess.run(apply_transform_op)
Note: the code hasn't been tested by myself, but I confirm the code is legal except minor code errors.
Note: function1 and function2 is kind of computation, such as 2*x, x^e or e^x and so on.
Refer: TensorFlow apply_gradients remotely
I coded up a very simple example with comments (inspired from the above answer) that is runnable to see gradient descent in action:
import tensorflow as tf
#funciton to transform gradients
def T(g, decay=1.0):
#return decayed gradient
return decay*g
# x variable
x = tf.Variable(10.0,name='x')
# b placeholder (simualtes the "data" part of the training)
b = tf.placeholder(tf.float32)
# make model (1/2)(x-b)^2
xx_b = 0.5*tf.pow(x-b,2)
y=xx_b
learning_rate = 1.0
opt = tf.train.GradientDescentOptimizer(learning_rate)
# gradient variable list = [ (gradient,variable) ]
gv = opt.compute_gradients(y,[x])
# transformed gradient variable list = [ (T(gradient),variable) ]
decay = 0.1 # decay the gradient for the sake of the example
tgv = [(T(g,decay=decay),v) for (g,v) in gv] #list [(grad,var)]
# apply transformed gradients (this case no transform)
apply_transform_op = opt.apply_gradients(tgv)
with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
epochs = 10
for i in range(epochs):
b_val = 1.0 #fake data (in SGD it would be different on every epoch)
print '----'
x_before_update = x.eval()
print 'before update',x_before_update
# compute gradients
grad_vals = sess.run([g for (g,v) in gv], feed_dict={b: b_val})
print 'grad_vals: ',grad_vals
# applies the gradients
result = sess.run(apply_transform_op, feed_dict={b: b_val})
print 'value of x should be: ', x_before_update - T(grad_vals[0], decay=decay)
x_after_update = x.eval()
print 'after update', x_after_update
you can observe the change in the variable as its trained and also the value of the gradient. Note that the only reason T decays the gradient because otherwise it reaches the global minimum in 1 step.
As an extra bonus, if you want to see it work with tensorboard, here you go! :)
## run cmd to collect model: python quadratic_minimizer.py --logdir=/tmp/quaratic_temp
## show board on browser run cmd: tensorboard --logdir=/tmp/quaratic_temp
## browser: http://localhost:6006/
import tensorflow as tf
#funciton to transform gradients
def T(g, decay=1.0):
#return decayed gradient
return decay*g
# x variable
x = tf.Variable(10.0,name='x')
# b placeholder (simualtes the "data" part of the training)
b = tf.placeholder(tf.float32)
# make model (1/2)(x-b)^2
xx_b = 0.5*tf.pow(x-b,2)
y=xx_b
learning_rate = 1.0
opt = tf.train.GradientDescentOptimizer(learning_rate)
# gradient variable list = [ (gradient,variable) ]
gv = opt.compute_gradients(y,[x])
# transformed gradient variable list = [ (T(gradient),variable) ]
decay = 0.9 # decay the gradient for the sake of the example
tgv = [ (T(g,decay=decay), v) for (g,v) in gv] #list [(grad,var)]
# apply transformed gradients (this case no transform)
apply_transform_op = opt.apply_gradients(tgv)
(dydx,_) = tgv[0]
x_scalar_summary = tf.scalar_summary("x", x)
grad_scalar_summary = tf.scalar_summary("dydx", dydx)
with tf.Session() as sess:
merged = tf.merge_all_summaries()
tensorboard_data_dump = '/tmp/quaratic_temp'
writer = tf.train.SummaryWriter(tensorboard_data_dump, sess.graph)
sess.run(tf.initialize_all_variables())
epochs = 14
for i in range(epochs):
b_val = 1.0 #fake data (in SGD it would be different on every epoch)
print '----'
x_before_update = x.eval()
print 'before update',x_before_update
# get gradients
#grad_list = [g for (g,v) in gv]
(summary_str_grad,grad_val) = sess.run([merged] + [dydx], feed_dict={b: b_val})
grad_vals = sess.run([g for (g,v) in gv], feed_dict={b: b_val})
print 'grad_vals: ',grad_vals
writer.add_summary(summary_str_grad, i)
# applies the gradients
[summary_str_apply_transform,_] = sess.run([merged,apply_transform_op], feed_dict={b: b_val})
writer.add_summary(summary_str_apply_transform, i)
print 'value of x after update should be: ', x_before_update - T(grad_vals[0], decay=decay)
x_after_update = x.eval()
print 'after update', x_after_update
Please see the code written below.
x = tf.placeholder("float", [None, 80])
W = tf.Variable(tf.zeros([80,2]))
b = tf.Variable(tf.zeros([2]))
y = tf.nn.softmax(tf.matmul(x,W) + b)
y_ = tf.placeholder("float", [None,2])
So here we see that there are 80 features in the data with only 2 possible outputs. I set the cross_entropy and the train_step like so.
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(tf.matmul(x, W) + b, y_)
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
Initialize all variables.
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
Then I use this code to "train" my Neural Network.
g = 0
for i in range(len(x_train)):
_, w_out, b_out = sess.run([train_step, W, b], feed_dict={x: [x_train[g]], y_: [y_train[g]]})
g += 1
print "...Trained..."
After training the network, it always produces the same accuracy rate regardless of how many times I train it. That accuracy rate is 0.856067 and I get to that accuracy with this code-
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
print sess.run(accuracy, feed_dict={x: x_test, y_: y_test})
0.856067
So this is where the question comes in. Is it because I have too small of dimensions? Maybe I should break the features into a 10x8 matrix? Maybe a 4x20 matrix? etc.
Then I try to get the probabilities of the actual test data producing a 0 or a 1 like so-
test_data_actual = genfromtxt('clean-test-actual.csv',delimiter=',') # Actual Test data
x_test_actual = []
for i in test_data_actual:
x_test_actual.append(i)
x_test_actual = np.array(x_test_actual)
ans = sess.run(y, feed_dict={x: x_test_actual})
And print out the probabilities:
print ans[0:10]
[[ 1. 0.]
[ 1. 0.]
[ 1. 0.]
[ 1. 0.]
[ 1. 0.]
[ 1. 0.]
[ 1. 0.]
[ 1. 0.]
[ 1. 0.]
[ 1. 0.]]
(Note: it does produce [ 0. 1.] sometimes.)
I then tried to see if applying the expert methodology would produce better results. Please see the following code.
def weight_variable(shape):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
def bias_variable(shape):
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial)
def conv2d(x, W):
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
def max_pool_2x2(x):
return tf.nn.max_pool(x, ksize=[1, 1, 1, 1],
strides=[1, 1, 1, 1], padding='SAME')
(Please note how I changed the strides in order to avoid errors).
W_conv1 = weight_variable([1, 80, 1, 1])
b_conv1 = bias_variable([1])
Here is where the question comes in again. I define the Tensor (vector/matrix if you will) as 80x1 (so 1 row with 80 features in it); I continue to do that throughout the rest of the code (please see below).
x_ = tf.reshape(x, [-1,1,80,1])
h_conv1 = tf.nn.relu(conv2d(x_, W_conv1) + b_conv1)
Second Convolutional Layer
h_pool1 = max_pool_2x2(h_conv1)
W_conv2 = weight_variable([1, 80, 1, 1])
b_conv2 = bias_variable([1])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)
Densely Connected Layer
W_fc1 = weight_variable([80, 1024])
b_fc1 = bias_variable([1024])
h_pool2_flat = tf.reshape(h_pool2, [-1, 80])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
Dropout
keep_prob = tf.placeholder("float")
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)
Readout
W_fc2 = weight_variable([1024, 2])
b_fc2 = bias_variable([2])
y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)
In the above you'll see that I defined the output as 2 possible answers (also to avoid errors).
Then cross_entropy and the train_step.
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(tf.matmul(h_fc1_drop, W_fc2) + b_fc2, y_)
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
Start the session.
sess.run(tf.initialize_all_variables())
"Train" the neural network.
g = 0
for i in range(len(x_train)):
if i%100 == 0:
train_accuracy = accuracy.eval(session=sess, feed_dict={x: [x_train[g]], y_: [y_train[g]], keep_prob: 1.0})
train_step.run(session=sess, feed_dict={x: [x_train[g]], y_: [y_train[g]], keep_prob: 0.5})
g += 1
print "test accuracy %g"%accuracy.eval(session=sess, feed_dict={
x: x_test, y_: y_test, keep_prob: 1.0})
test accuracy 0.929267
And, once again, it always produces 0.929267 as the output.
The probabilities on the actual data producing a 0 or a 1 are as follows:
[[ 0.92820859 0.07179145]
[ 0.92820859 0.07179145]
[ 0.92820859 0.07179145]
[ 0.92820859 0.07179145]
[ 0.92820859 0.07179145]
[ 0.92820859 0.07179145]
[ 0.96712834 0.03287172]
[ 0.92820859 0.07179145]
[ 0.92820859 0.07179145]
[ 0.92820859 0.07179145]]
As you see, there is some variance in these probabilities, but typically just the same result.
I know that this isn't a Deep Learning problem. This is obviously a training problem. I know that there should always be some variance in the training accuracy every time you reinitialize the variables and retrain the network, but I just don't know why or where it's going wrong.
The answer is 2 fold.
One problem is with the dimensions/parameters. The other problem is that the features are being placed in the wrong spot.
W_conv1 = weight_variable([1, 2, 1, 80])
b_conv1 = bias_variable([80])
Notice the first two numbers in the weight_variable correspond to the dimensions of the input. The second two numbers correspond to the dimensions of the feature tensor. The bias_variable always takes the final number in the weight_variable.
Second Convolutional Layer
W_conv2 = weight_variable([1, 2, 80, 160])
b_conv2 = bias_variable([160])
Here the first two numbers still correspond to the dimensions of the input. The second two numbers correspond to the amount of features and the weighted network that results from the 80 previous features. In this case, we double the weighted network. 80x2=160. The bias_variable then takes the final number in the weight_variable. If you were to finish the code at this point, the last number in the weight_variable would be a 1 in order to prevent dimensional errors due to the shape of the input tensor and the output tensor. But, instead, for better predictions, let's add a third convolutional layer.
Third Convolutional Layer
W_conv3 = weight_variable([1, 2, 160, 1])
b_conv3 = bias_variable([1])
Once again, the first two numbers in the weight_variable take the shape of the input. The third number corresponds to the amount of the weighted variables we established in the Second Convolutional Layer. The last number in the weight_variable now becomes 1 so we don't run into any dimension errors on the output that we are predicting. In this case, the output has the dimensions of 1, 2.
W_fc2 = weight_variable([80, 1024])
b_fc2 = bias_variable([1024])
Here, the number of neurons is 1024 which is completely arbitrary, but the first number in the weight_variable needs to be something that the dimensions of our feature matrix needs to be divisible by. In this case it can be any number (such as 2, 4, 10, 20, 40, 80). Once again, the bias_variable takes the last number in the weight_variable.
At this point, make sure that the last number in h_pool3_flat = tf.reshape(h_pool3, [-1, 80]) corresponds to the first number in the W_fc2 weight_variable.
Now when you run your training program you will notice that the outcome varies and won't always guess all 1's or all 0's.
When you want to predict the probabilities, you have to feed x to the softmax variable-> y_conv=tf.nn.softmax(tf.matmul(h_fc2_drop, W_fc3) + b_fc3) like so-
ans = sess.run(y_conv, feed_dict={x: x_test_actual, keep_prob: 1.0})
You can alter the keep_prob variable, but keeping it at a 1.0 always produces the best results. Now, if you print out ans you'll have something that looks like this-
[[ 0.90855026 0.09144982]
[ 0.93020624 0.06979381]
[ 0.98385173 0.0161483 ]
[ 0.93948185 0.06051811]
[ 0.90705943 0.09294061]
[ 0.95702559 0.04297439]
[ 0.95543593 0.04456403]
[ 0.95944828 0.0405517 ]
[ 0.99154049 0.00845954]
[ 0.84375167 0.1562483 ]
[ 0.98449463 0.01550537]
[ 0.97772813 0.02227189]
[ 0.98341942 0.01658053]
[ 0.93026513 0.06973486]
[ 0.93376994 0.06623009]
[ 0.98026556 0.01973441]
[ 0.93210858 0.06789146]
Notice how the probabilities vary. Your training is now working properly.