I tried to build a simple MLP with an input layer (2 neurons), a hidden layer (5 neurons) and an output layer (1 neuron). I planned to train and feed it with [[0., 0.], [0., 1.], [1., 0.], [1., 1.]] for getting the desired output of [0., 1., 1., 0.] (elementwise).
Unfortunately my code refuses to run. I keep getting dimensionality errors no matter what I'm trying. Quite frustrating :/ I think I'm missing something but I can not figure out what is wrong.
For better readability I also uploaded the code to a pastebin: code
Any ideas?
import tensorflow as tf
#####################
# preparation stuff #
#####################
# define input and output data
input_data = [[0., 0.], [0., 1.], [1., 0.], [1., 1.]] # XOR input
output_data = [0., 1., 1., 0.] # XOR output
# create a placeholder for the input
# None indicates a variable batch size for the input
# one input's dimension is [1, 2]
n_input = tf.placeholder(tf.float32, shape=[None, 2])
# number of neurons in the hidden layer
hidden_nodes = 5
################
# hidden layer #
################
b_hidden = tf.Variable(0.1) # hidden layer's bias neuron
W_hidden = tf.Variable(tf.random_uniform([hidden_nodes, 2], -1.0, 1.0)) # hidden layer's weight matrix
# initialized with a uniform distribution
hidden = tf.sigmoid(tf.matmul(W_hidden, n_input) + b_hidden) # calc hidden layer's activation
################
# output layer #
################
W_output = tf.Variable(tf.random_uniform([hidden_nodes, 1], -1.0, 1.0)) # output layer's weight matrix
output = tf.sigmoid(tf.matmul(W_output, hidden)) # calc output layer's activation
############
# learning #
############
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(output, n_input) # calc cross entropy between current
# output and desired output
loss = tf.reduce_mean(cross_entropy) # mean the cross_entropy
optimizer = tf.train.GradientDescentOptimizer(0.1) # take a gradient descent for optimizing with a "stepsize" of 0.1
train = optimizer.minimize(loss) # let the optimizer train
####################
# initialize graph #
####################
init = tf.initialize_all_variables()
sess = tf.Session() # create the session and therefore the graph
sess.run(init) # initialize all variables
# train the network
for epoch in xrange(0, 201):
sess.run(train) # run the training operation
if epoch % 20 == 0:
print("step: {:>3} | W: {} | b: {}".format(epoch, sess.run(W_hidden), sess.run(b_hidden)))
EDIT: I am still getting errors :/
hidden = tf.sigmoid(tf.matmul(n_input, W_hidden) + b_hidden)
outputs line 27 (...) ValueError: Dimensions Dimension(2) and Dimension(5) are not compatible. Altering the line to:
hidden = tf.sigmoid(tf.matmul(W_hidden, n_input) + b_hidden)
seems to be working, but then the error appears in:
output = tf.sigmoid(tf.matmul(hidden, W_output))
telling me: line 34 (...) ValueError: Dimensions Dimension(2) and Dimension(5) are not compatible
Turning the statement to:
output = tf.sigmoid(tf.matmul(W_output, hidden))
also throws an exception: line 34 (...) ValueError: Dimensions Dimension(1) and Dimension(5) are not compatible.
EDIT2: I do not really understand this. Shouldn't hidden be W_hidden x n_input.T, since in dimensions this would be (5, 2) x (2, 1)? If I transpose n_input hidden is still working (I even don't get the point why it is working without a transpose at all). However, output keeps throwing errors but this operation in dimensions should be (1, 5) x (5, 1)?!
(0) It's helpful to include the error output - it's also a useful thing to look at, because it does identify exactly where you were having shape problems.
(1) The shape errors arose because you have the arguments to matmul backwards in both of your matmuls, and have the tf.Variable backwards. The general rule is that the weights for layer that has input_size, output_size should be [input_size, output_size], and the matmul should be tf.matmul(input_to_layer, weights_for_layer) (and then add the biases, which are of shape [output_size]).
So with your code,
W_hidden = tf.Variable(tf.random_uniform([hidden_nodes, 2], -1.0, 1.0))
should be:
W_hidden = tf.Variable(tf.random_uniform([2, hidden_nodes], -1.0, 1.0))
and
hidden = tf.sigmoid(tf.matmul(W_hidden, n_input) + b_hidden)
should be tf.matmul(n_input, W_hidden); and
output = tf.sigmoid(tf.matmul(W_output, hidden))
should be tf.matmul(hidden, W_output)
(2) Once you've fixed those bugs, your run needs to be fed a feed_dict:
sess.run(train)
should be:
sess.run(train, feed_dict={n_input: input_data})
At least, I presume that this is what you're trying to achieve.
Related
I am training a keras model whose last layer is a single sigmoid unit:
output = Dense(units=1, activation='sigmoid')
I am training this model with some training data in which the expected output is always a number between 0.0 and 1.0.
I am compiling the model with mean-squared-error:
model.compile(optimizer='adam', loss='mse')
Since both the expected output and the real output are single floats between 0 and 1, I was expecting a loss between 0 and 1 as well, but when I start the training I get a loss of 3.3932, larger than 1.
Am I missing something?
Edit:
I am adding an example to show the problem:
https://drive.google.com/file/d/1fBBrgW-HlBYhG-BUARjTXn3SpWqrHHPK/view?usp=sharing
(I cannot just paste the code because I need to attach the training data)
After running python stackoverflow.py, the summary of the model will be shown, as well as the training process.
I also print the minimum and maximum values of y_true each step to verify that they are within the [0, 1] range.
There is no need to wait for the training to finish, you will see that the loss during the first few epochs is much larger than 1.
First, we can demystify mse loss - it's a normal callable function in tf.keras:
import tensorflow as tf
import numpy as np
mse = tf.keras.losses.mse
print(mse([1] * 3, [0] * 3)) # tf.Tensor(1, shape=(), dtype=int32)
Next, as the name "mean squared error" implies, it's a mean, meaning size of vectors passed to it do not change the value as long as the mean is the same:
print(mse([1] * 10, [0] * 10)) # tf.Tensor(1, shape=(), dtype=int32)
In order for the mse to exceed 1, average error must exceed 1:
print( mse(np.random.random((100,)), np.random.random((100,))) ) # tf.Tensor(0.14863832582680103, shape=(), dtype=float64)
print( mse( 10 * np.random.random((100,)), np.random.random((100,))) ) # tf.Tensor(30.51209646429651, shape=(), dtype=float64)
Lastly, sigmoid indeed guarantees that output is between 0 and 1:
sigmoid = tf.keras.activations.sigmoid
signal = 10 * np.random.random((100,))
output = sigmoid(signal)
print(f"Raw: {np.mean(signal):.2f}; Sigmoid: {np.mean(output):.2f}" ) # Raw: 5.35; Sigmoid: 0.92
What this implies is that in your code, mean of y_true is NOT between 0 and 1.
You can verify this with np.mean(y_true).
I do not have an answer for the question asked. I am getting nans in my MSE loss, with input in range [0,1] and sigmoid at output. So I thought the question is relevant.
Here are a few observations about sigmoid:
import tensorflow as tf
import numpy as np
x=tf.constant([-20, -1.0, 0.0, 1.0, 20], dtype = tf.float32)
x=tf.keras.activations.sigmoid(x)
x.numpy()
# array([2.0611537e-09, 2.6894143e-01, 5.0000000e-01, 7.3105860e-01,
# 1.0000000e+00], dtype=float32)
x=tf.constant([float('nan')]*5, dtype = tf.float32)
x=tf.keras.activations.sigmoid(x)
x.numpy()
# array([nan, nan, nan, nan, nan], dtype=float32)
x=tf.constant([np.inf]*5, dtype = tf.float32)
x=tf.keras.activations.sigmoid(x)
x.numpy()
# array([1., 1., 1., 1., 1.], dtype=float32)
So, it is possible to get nans out of sigmoid. Just in case someone (me, in near future) has this doubt (again).
I'm still working on my understanding of the PyTorch autograd system. One thing I'm struggling at is to understand why .clamp(min=0) and nn.functional.relu() seem to have different backward passes.
It's especially confusing as .clamp is used equivalently to relu in PyTorch tutorials, such as https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-nn.
I found this when analysing the gradients of a simple fully connected net with one hidden layer and a relu activation (linear in the outputlayer).
to my understanding the output of the following code should be just zeros. I hope someone can show me what I am missing.
import torch
dtype = torch.float
x = torch.tensor([[3,2,1],
[1,0,2],
[4,1,2],
[0,0,1]], dtype=dtype)
y = torch.ones(4,4)
w1_a = torch.tensor([[1,2],
[0,1],
[4,0]], dtype=dtype, requires_grad=True)
w1_b = w1_a.clone().detach()
w1_b.requires_grad = True
w2_a = torch.tensor([[-1, 1],
[-2, 3]], dtype=dtype, requires_grad=True)
w2_b = w2_a.clone().detach()
w2_b.requires_grad = True
y_hat_a = torch.nn.functional.relu(x.mm(w1_a)).mm(w2_a)
y_a = torch.ones_like(y_hat_a)
y_hat_b = x.mm(w1_b).clamp(min=0).mm(w2_b)
y_b = torch.ones_like(y_hat_b)
loss_a = (y_hat_a - y_a).pow(2).sum()
loss_b = (y_hat_b - y_b).pow(2).sum()
loss_a.backward()
loss_b.backward()
print(w1_a.grad - w1_b.grad)
print(w2_a.grad - w2_b.grad)
# OUT:
# tensor([[ 0., 0.],
# [ 0., 0.],
# [ 0., -38.]])
# tensor([[0., 0.],
# [0., 0.]])
#
The reason is that clamp and relu produce different gradients at 0. Checking with a scalar tensor x = 0 the two versions: (x.clamp(min=0) - 1.0).pow(2).backward() versus (relu(x) - 1.0).pow(2).backward(). The resulting x.grad is 0 for the relu version but it is -2 for the clamp version. That means relu chooses x == 0 --> grad = 0 while clamp chooses x == 0 --> grad = 1.
Simply exchanging the nn.softmax function for a combination which uses tf.exp, keeping everything else like it was, causes not only the gradients to contain NaN but also the intermediate variable s. I have no idea why this is.
tempX = x
tempW = W
tempMult = tf.matmul(tempX, W)
s = tempMult + b
#! ----------------------------
#p = tf.nn.softmax(s)
p = tf.exp(s) / tf.reduce_sum(tf.exp(s), axis=1)
#!------------------------------
myTemp = y*tf.log(p)
cost = tf.reduce_mean(-tf.reduce_sum(myTemp, reduction_indices=1)) + mylambda*tf.reduce_sum(tf.multiply(W,W))
grad_W, grad_b = tf.gradients(xs=[W, b], ys=cost)
new_W = W.assign(W - tf.multiply(learning_rate, grad_W))
new_b = b.assign(b - tf.multiply(learning_rate, grad_b))
Answer
tf.exp(s) easily overflows for large s. That's the main reason that tf.nn.softmax doesn't actually use that equation but does something equilivent to it (according to the docs).
Discussion
When I rewrote your softmax function to
p = tf.exp(s) / tf.reshape( tf.reduce_sum(tf.exp(s), axis=1), [-1,1] )
It worked without a problem.
Here is a fully working python 2.7 implementation that uses a hand-crafted softmax and works (using the reshape function)
# -- imports --
import tensorflow as tf
import numpy as np
# np.set_printoptions(precision=1) reduces np precision output to 1 digit
np.set_printoptions(precision=2, suppress=True)
# -- constant data --
x = [[0., 0.], [1., 1.], [1., 0.], [0., 1.]]
y_ = [[1., 0.], [1., 0.], [0., 1.], [0., 1.]]
# -- induction --
# 1x2 input -> 2x3 hidden sigmoid -> 3x1 sigmoid output
# Layer 0 = the x2 inputs
x0 = tf.constant(x, dtype=tf.float32)
y0 = tf.constant(y_, dtype=tf.float32)
# Layer 1 = the 2x3 hidden sigmoid
m1 = tf.Variable(tf.random_uniform([2, 3], minval=0.1, maxval=0.9, dtype=tf.float32))
b1 = tf.Variable(tf.random_uniform([3], minval=0.1, maxval=0.9, dtype=tf.float32))
h1 = tf.sigmoid(tf.matmul(x0, m1) + b1)
# Layer 2 = the 3x2 softmax output
m2 = tf.Variable(tf.random_uniform([3, 2], minval=0.1, maxval=0.9, dtype=tf.float32))
b2 = tf.Variable(tf.random_uniform([2], minval=0.1, maxval=0.9, dtype=tf.float32))
h2 = tf.matmul(h1, m2) + b2
y_out = tf.exp(h2) / tf.reshape( tf.reduce_sum(tf.exp(h2), axis=1) , [-1,1] )
# -- loss --
# loss : sum of the squares of y0 - y_out
loss = tf.reduce_sum(tf.square(y0 - y_out))
# training step : gradient decent (1.0) to minimize loss
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
# -- training --
# run 500 times using all the X and Y
# print out the loss and any other interesting info
#with tf.Session() as sess:
sess = tf.Session()
sess.run(tf.global_variables_initializer())
print "\nloss"
for step in range(500):
sess.run(train)
if (step + 1) % 100 == 0:
print sess.run(loss)
results = sess.run([m1, b1, m2, b2, y_out, loss])
labels = "m1,b1,m2,b2,y_out,loss".split(",")
for label, result in zip(*(labels, results)):
print ""
print label
print result
print ""
Perhaps your initial values for M and b are too large. I tried re-running my above code but with with weights initialized to large numbers and I was able to reproduce your NaN issue.
I would like to know if there is an easy way to constrain variables in a matrix in TensorFlow.
As a toy example, I wrote a piece of code where I would like my input_matrix to converge towards [[2., 1.], [1., 2.]].
import tensorflow as tf
sess = tf.Session()
v1 = tf.Variable(1.)
v2 = tf.Variable(2.)
#Here I specify that some of the variables in the matrix must have the same values, but it obviously doesn't work since TensorFlow variables need to be initialized before being used
input_matrix = tf.Variable([[v1, v2], [v2, v1]])
objective_matrix = tf.constant([[0., 1.], [1., 4.]])
optimizer = tf.train.GradientDescentOptimizer(1e-1)
cost = tf.reduce_sum(tf.square(tf.subtract(objective_matrix, input_matrix)))
train_step = optimizer.minimize(cost)
sess.run(tf.global_variables_initializer())
for _ in range(100):
sess.run(train_step)
Then, is it possible to force some elements of a matrix to be equal or at least the gradients ?
Given the DNN (simple case of multilayered perceptron) with 2 hidden layers of 5 and 3 dimensions respectively, I am training a model to recognize the OR gate.
Using tensorflow learn, it seems like it's giving me the reverse output and I have no idea why:
from tensorflow.contrib import learn
classifier = learn.DNNClassifier(hidden_units=[5, 3], n_classes=2)
or_input = np.array([[0.,0.], [0.,1.], [1.,0.]])
or_output = np.array([[0,1,1]]).T
classifier.fit(or_input, or_output, steps=0.05, batch_size=3)
classifier.predict(np.array([ [1., 1.], [1., 0.] , [0., 0.] , [0., 1.]]))
[out]:
array([0, 0, 1, 0])
If I'm doing it "old-school", without the tensorflow.learn as follows, I get the expected answer.
import tensorflow as tf
# Parameters
learning_rate = 1.0
num_epochs = 1000
# Network Parameters
input_dim = 2 # Input dimensions.
hidden_dim_1 = 5 # 1st layer number of features
hidden_dim_2 = 3 # 2nd layer number of features
output_dim = 1 # Output dimensions.
# tf Graph input
x = tf.placeholder("float", [None, input_dim])
y = tf.placeholder("float", [hidden_dim_2, output_dim])
# With biases.
weights = {
'syn0': tf.Variable(tf.random_normal([input_dim, hidden_dim_1])),
'syn1': tf.Variable(tf.random_normal([hidden_dim_1, hidden_dim_2])),
'syn2': tf.Variable(tf.random_normal([hidden_dim_2, output_dim]))
}
biases = {
'b0': tf.Variable(tf.random_normal([hidden_dim_1])),
'b1': tf.Variable(tf.random_normal([hidden_dim_2])),
'b2': tf.Variable(tf.random_normal([output_dim]))
}
# Create a model
def multilayer_perceptron(X, weights, biases):
# Hidden layer 1 + sigmoid activation function
layer_1 = tf.add(tf.matmul(X, weights['syn0']), biases['b0'])
layer_1 = tf.nn.sigmoid(layer_1)
# Hidden layer 2 + sigmoid activation function
layer_2 = tf.add(tf.matmul(layer_1, weights['syn1']), biases['b1'])
layer_2 = tf.nn.sigmoid(layer_2)
# Output layer
out_layer = tf.matmul(layer_2, weights['syn2']) + biases['b2']
out_layer = tf.nn.sigmoid(out_layer)
return out_layer
# Construct model
pred = multilayer_perceptron(x, weights, biases)
# Define loss and optimizer
cost = tf.sub(y, pred)
# Or you can use fancy cost like:
##tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
init = tf.initialize_all_variables()
or_input = np.array([[0.,0.], [0.,1.], [1.,0.]])
or_output = np.array([[0.,1.,1.]]).T
# Launch the graph
with tf.Session() as sess:
sess.run(init)
# Training cycle
for epoch in range(num_epochs):
batch_x, batch_y = or_input, or_output # Loop over all data points.
# Run optimization op (backprop) and cost op (to get loss value)
_, c = sess.run([optimizer, cost], feed_dict={x: batch_x, y: batch_y})
#print (c)
# Now let's test it on the unknown dataset.
new_inputs = np.array([[1.,1.], [1.,0.]])
feed_dict = {x: new_inputs}
predictions = sess.run(pred, feed_dict)
print (predictions)
[out]:
[[ 0.99998868]
[ 0.99998868]]
Why is it that I am getting the reversed output using tensorflow.learn? Am I doing something wrongly using the tensorflow.learn?
How do I get the tensorflow.learn code to produce the same output as the "old-school" tensorflow framework?
If you specify the right argument for steps you get the good results:
classifier.fit(or_input, or_output, steps=1000, batch_size=3)
Result:
array([1, 1, 0, 1])
How does steps work
The steps argument specifies the number of times you run the training operation. Let me give you some examples:
with batch_size = 16 and steps = 10, you will see a total of 160 examples
in your example, batch_size = 3 and steps = 1000, the algorithm will see 3000 examples. In fact, it will see 1000 times the same 3 examples you provided
So, steps is not the number of epochs, it is the number of times you run the training op, or the number of times you see a new batch.
Why is steps = 0.05 allowed?
In the tf.learn code, they don't check if steps is an integer. They just run a while loop checking that (at this line):
last_step < max_steps
So if max_steps = 0.05, it will behave the same as if max_steps = 1 (last_step is incremented in the loop).