Simply exchanging the nn.softmax function for a combination which uses tf.exp, keeping everything else like it was, causes not only the gradients to contain NaN but also the intermediate variable s. I have no idea why this is.
tempX = x
tempW = W
tempMult = tf.matmul(tempX, W)
s = tempMult + b
#! ----------------------------
#p = tf.nn.softmax(s)
p = tf.exp(s) / tf.reduce_sum(tf.exp(s), axis=1)
#!------------------------------
myTemp = y*tf.log(p)
cost = tf.reduce_mean(-tf.reduce_sum(myTemp, reduction_indices=1)) + mylambda*tf.reduce_sum(tf.multiply(W,W))
grad_W, grad_b = tf.gradients(xs=[W, b], ys=cost)
new_W = W.assign(W - tf.multiply(learning_rate, grad_W))
new_b = b.assign(b - tf.multiply(learning_rate, grad_b))
Answer
tf.exp(s) easily overflows for large s. That's the main reason that tf.nn.softmax doesn't actually use that equation but does something equilivent to it (according to the docs).
Discussion
When I rewrote your softmax function to
p = tf.exp(s) / tf.reshape( tf.reduce_sum(tf.exp(s), axis=1), [-1,1] )
It worked without a problem.
Here is a fully working python 2.7 implementation that uses a hand-crafted softmax and works (using the reshape function)
# -- imports --
import tensorflow as tf
import numpy as np
# np.set_printoptions(precision=1) reduces np precision output to 1 digit
np.set_printoptions(precision=2, suppress=True)
# -- constant data --
x = [[0., 0.], [1., 1.], [1., 0.], [0., 1.]]
y_ = [[1., 0.], [1., 0.], [0., 1.], [0., 1.]]
# -- induction --
# 1x2 input -> 2x3 hidden sigmoid -> 3x1 sigmoid output
# Layer 0 = the x2 inputs
x0 = tf.constant(x, dtype=tf.float32)
y0 = tf.constant(y_, dtype=tf.float32)
# Layer 1 = the 2x3 hidden sigmoid
m1 = tf.Variable(tf.random_uniform([2, 3], minval=0.1, maxval=0.9, dtype=tf.float32))
b1 = tf.Variable(tf.random_uniform([3], minval=0.1, maxval=0.9, dtype=tf.float32))
h1 = tf.sigmoid(tf.matmul(x0, m1) + b1)
# Layer 2 = the 3x2 softmax output
m2 = tf.Variable(tf.random_uniform([3, 2], minval=0.1, maxval=0.9, dtype=tf.float32))
b2 = tf.Variable(tf.random_uniform([2], minval=0.1, maxval=0.9, dtype=tf.float32))
h2 = tf.matmul(h1, m2) + b2
y_out = tf.exp(h2) / tf.reshape( tf.reduce_sum(tf.exp(h2), axis=1) , [-1,1] )
# -- loss --
# loss : sum of the squares of y0 - y_out
loss = tf.reduce_sum(tf.square(y0 - y_out))
# training step : gradient decent (1.0) to minimize loss
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
# -- training --
# run 500 times using all the X and Y
# print out the loss and any other interesting info
#with tf.Session() as sess:
sess = tf.Session()
sess.run(tf.global_variables_initializer())
print "\nloss"
for step in range(500):
sess.run(train)
if (step + 1) % 100 == 0:
print sess.run(loss)
results = sess.run([m1, b1, m2, b2, y_out, loss])
labels = "m1,b1,m2,b2,y_out,loss".split(",")
for label, result in zip(*(labels, results)):
print ""
print label
print result
print ""
Perhaps your initial values for M and b are too large. I tried re-running my above code but with with weights initialized to large numbers and I was able to reproduce your NaN issue.
Related
I am trying to calculate cross entropy loss using pytorch's BCELoss Function for a binary classification problem. While tinkering I found this weird behaviour.
from torch import nn
sigmoid = nn.Sigmoid()
loss = nn.BCELoss(reduction="sum")
target = torch.tensor([0., 1.])
input1 = torch.tensor([1., 1.], requires_grad=True)
input2 = sigmoid(torch.tensor([10., 10.], requires_grad=True))
print(input2) #tensor([1.0000, 1.0000], grad_fn=<SigmoidBackward>)
print(loss(input1, target)) #tensor(100., grad_fn=<BinaryCrossEntropyBackward>)
print(loss(input2, target)) #tensor(9.9996, grad_fn=<BinaryCrossEntropyBackward>)
Since both input1 and input2 have same value, shouldn't it return the same loss value instead of 100 and 9.9996. The correct loss value should be 100 since I am multiplying log(0) ~-infinity which is capped at -100 in pytorch. https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html
What is going on here and where am I going wrong?
sigmoid(10) is not exactly equal to 1:
>>> 1 / (1 + torch.exp(-torch.tensor(10.))).item()
0.9999545833234493
In your case:
>>> sigmoid(torch.tensor([10., 10.], requires_grad=True)).tolist()
[0.9999545812606812, 0.9999545812606812]
Thus input1 is not the same as input2: [1.0, 1.0] vs [0.9999545812606812, 0.9999545812606812],
Let's compute BCE manually:
def bce(x, y):
return - (y * torch.log(x) + (1 - y) * torch.log(1 - x)).item()
# input1
x1 = torch.tensor(1.)
x2 = torch.tensor(1.)
y1 = torch.tensor(0.)
y2 = torch.tensor(1.)
print("input1:", sum([bce(x1, y1), bce(x2, y2)]))
# input2
x1 = torch.tensor(0.9999545812606812)
x2 = torch.tensor(0.9999545812606812)
y1 = torch.tensor(0.)
y2 = torch.tensor(1.)
print("input2:", sum([bce(x1, y1), bce(x2, y2)]))
input1: nan
input2: 9.999631525119185
For input1 we get nan, but according to docs:
Our solution is that BCELoss clamps its log function outputs to be greater than or equal to -100. This way, we can always have a finite loss value and a linear backward method.
That's why we have 100 in a final pytorch's BCE output.
I'm still working on my understanding of the PyTorch autograd system. One thing I'm struggling at is to understand why .clamp(min=0) and nn.functional.relu() seem to have different backward passes.
It's especially confusing as .clamp is used equivalently to relu in PyTorch tutorials, such as https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-nn.
I found this when analysing the gradients of a simple fully connected net with one hidden layer and a relu activation (linear in the outputlayer).
to my understanding the output of the following code should be just zeros. I hope someone can show me what I am missing.
import torch
dtype = torch.float
x = torch.tensor([[3,2,1],
[1,0,2],
[4,1,2],
[0,0,1]], dtype=dtype)
y = torch.ones(4,4)
w1_a = torch.tensor([[1,2],
[0,1],
[4,0]], dtype=dtype, requires_grad=True)
w1_b = w1_a.clone().detach()
w1_b.requires_grad = True
w2_a = torch.tensor([[-1, 1],
[-2, 3]], dtype=dtype, requires_grad=True)
w2_b = w2_a.clone().detach()
w2_b.requires_grad = True
y_hat_a = torch.nn.functional.relu(x.mm(w1_a)).mm(w2_a)
y_a = torch.ones_like(y_hat_a)
y_hat_b = x.mm(w1_b).clamp(min=0).mm(w2_b)
y_b = torch.ones_like(y_hat_b)
loss_a = (y_hat_a - y_a).pow(2).sum()
loss_b = (y_hat_b - y_b).pow(2).sum()
loss_a.backward()
loss_b.backward()
print(w1_a.grad - w1_b.grad)
print(w2_a.grad - w2_b.grad)
# OUT:
# tensor([[ 0., 0.],
# [ 0., 0.],
# [ 0., -38.]])
# tensor([[0., 0.],
# [0., 0.]])
#
The reason is that clamp and relu produce different gradients at 0. Checking with a scalar tensor x = 0 the two versions: (x.clamp(min=0) - 1.0).pow(2).backward() versus (relu(x) - 1.0).pow(2).backward(). The resulting x.grad is 0 for the relu version but it is -2 for the clamp version. That means relu chooses x == 0 --> grad = 0 while clamp chooses x == 0 --> grad = 1.
Problem description
I have inputs x that are indicator variables, and outputs y, where each row is a random one-hot vector that depends on the values of x (data sample shown below).
I want to train a model that essentially learns the probabilistic relationship between x and y in the form of per-column weights. The model must "choose" one, and only one, indicator to output. My current approach is to sample a categorical random variable and produce a one-hot vector as a prediction.
The issue is that I'm getting an error ValueError: An operation has `None` for gradient when I try to train my Keras model.
I find this error odd, because I've trained mixture networks using Keras and Tensorflow, which use tf.contrib.distributions.Categorical, and I did not run into any gradient-related issues.
Code
Experiment
import tensorflow as tf
import tensorflow.contrib.distributions as tfd
import numpy as np
from keras import backend as K
from keras.layers import Layer
from keras.models import Sequential
from keras.utils import to_categorical
def make_xy_prob(rng, size=10000):
rng = np.random.RandomState(rng) if isinstance(rng, int) else rng
cols = 3
weights = np.array([[1, 2, 3]])
# generate data and drop zeros for now
x = rng.choice(2, (size, cols))
is_zeros = x.sum(axis=1) == 0
x = x[~is_zeros]
# use weights to create probabilities for determining y
weighted_x = x * weights
prob_x = weighted_x / weighted_x.sum(axis=1, keepdims=True)
y = np.row_stack([to_categorical(rng.choice(cols, p=p), cols) for p in prob_x])
# add zeros back and shuffle
zeros = np.zeros(((size - len(x), cols)))
x = np.row_stack([x, zeros])
y = np.row_stack([y, zeros])
shuffle_idx = rng.permutation(size)
x = x[shuffle_idx]
y = y[shuffle_idx]
return x, y
class OneHotGate(Layer):
def build(self, input_shape):
self.kernel = self.add_weight(name='kernel', shape=(1, input_shape[1]), initializer='ones')
def call(self, x):
zero_cond = x < 1
x_shape = tf.shape(x)
# weight indicators so that more probability is assigned to more likely columns
weighted_x = x * self.kernel
# fill zeros with -inf so that zero probability is assigned to that column
ninf_fill = tf.fill(x_shape, -np.inf)
masked_x = tf.where(zero_cond, ninf_fill, weighted_x)
onehot_gate = tf.squeeze(tfd.OneHotCategorical(logits=masked_x, dtype=x.dtype).sample(1))
# fill gate with zeros where input was originally zero
zeros_fill = tf.fill(x_shape, 0.0)
masked_gate = tf.where(zero_cond, zeros_fill, onehot_gate)
return masked_gate
def experiment(epochs=10):
K.clear_session()
rng = np.random.RandomState(2)
X, y = make_xy_prob(rng)
input_shape = (X.shape[1], )
model = Sequential()
gate_layer = OneHotGate(input_shape=input_shape)
model.add(gate_layer)
model.compile('adam', 'categorical_crossentropy')
model.fit(X, y, 64, epochs, verbose=1)
Data sample
>>> x
array([[1., 1., 1.],
[0., 1., 0.],
[1., 0., 1.],
...,
[1., 1., 1.],
[1., 1., 1.],
[1., 1., 0.]])
>>> y
array([[0., 0., 1.],
[0., 1., 0.],
[1., 0., 0.],
...,
[0., 0., 1.],
[1., 0., 0.],
[1., 0., 0.]])
Error
ValueError: An operation has `None` for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.
The problem lies in the fact that in OneHotCategorical performs a discontinuous sampling - what causes gradient computation to fail. In order to replace this discontinuous sampling with a continuous (relaxed) version one may try to use RelaxedOneHotCategorical (which is based on interesting Gumbel Softmax technique).
I would like to know if there is an easy way to constrain variables in a matrix in TensorFlow.
As a toy example, I wrote a piece of code where I would like my input_matrix to converge towards [[2., 1.], [1., 2.]].
import tensorflow as tf
sess = tf.Session()
v1 = tf.Variable(1.)
v2 = tf.Variable(2.)
#Here I specify that some of the variables in the matrix must have the same values, but it obviously doesn't work since TensorFlow variables need to be initialized before being used
input_matrix = tf.Variable([[v1, v2], [v2, v1]])
objective_matrix = tf.constant([[0., 1.], [1., 4.]])
optimizer = tf.train.GradientDescentOptimizer(1e-1)
cost = tf.reduce_sum(tf.square(tf.subtract(objective_matrix, input_matrix)))
train_step = optimizer.minimize(cost)
sess.run(tf.global_variables_initializer())
for _ in range(100):
sess.run(train_step)
Then, is it possible to force some elements of a matrix to be equal or at least the gradients ?
I tried to build a simple MLP with an input layer (2 neurons), a hidden layer (5 neurons) and an output layer (1 neuron). I planned to train and feed it with [[0., 0.], [0., 1.], [1., 0.], [1., 1.]] for getting the desired output of [0., 1., 1., 0.] (elementwise).
Unfortunately my code refuses to run. I keep getting dimensionality errors no matter what I'm trying. Quite frustrating :/ I think I'm missing something but I can not figure out what is wrong.
For better readability I also uploaded the code to a pastebin: code
Any ideas?
import tensorflow as tf
#####################
# preparation stuff #
#####################
# define input and output data
input_data = [[0., 0.], [0., 1.], [1., 0.], [1., 1.]] # XOR input
output_data = [0., 1., 1., 0.] # XOR output
# create a placeholder for the input
# None indicates a variable batch size for the input
# one input's dimension is [1, 2]
n_input = tf.placeholder(tf.float32, shape=[None, 2])
# number of neurons in the hidden layer
hidden_nodes = 5
################
# hidden layer #
################
b_hidden = tf.Variable(0.1) # hidden layer's bias neuron
W_hidden = tf.Variable(tf.random_uniform([hidden_nodes, 2], -1.0, 1.0)) # hidden layer's weight matrix
# initialized with a uniform distribution
hidden = tf.sigmoid(tf.matmul(W_hidden, n_input) + b_hidden) # calc hidden layer's activation
################
# output layer #
################
W_output = tf.Variable(tf.random_uniform([hidden_nodes, 1], -1.0, 1.0)) # output layer's weight matrix
output = tf.sigmoid(tf.matmul(W_output, hidden)) # calc output layer's activation
############
# learning #
############
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(output, n_input) # calc cross entropy between current
# output and desired output
loss = tf.reduce_mean(cross_entropy) # mean the cross_entropy
optimizer = tf.train.GradientDescentOptimizer(0.1) # take a gradient descent for optimizing with a "stepsize" of 0.1
train = optimizer.minimize(loss) # let the optimizer train
####################
# initialize graph #
####################
init = tf.initialize_all_variables()
sess = tf.Session() # create the session and therefore the graph
sess.run(init) # initialize all variables
# train the network
for epoch in xrange(0, 201):
sess.run(train) # run the training operation
if epoch % 20 == 0:
print("step: {:>3} | W: {} | b: {}".format(epoch, sess.run(W_hidden), sess.run(b_hidden)))
EDIT: I am still getting errors :/
hidden = tf.sigmoid(tf.matmul(n_input, W_hidden) + b_hidden)
outputs line 27 (...) ValueError: Dimensions Dimension(2) and Dimension(5) are not compatible. Altering the line to:
hidden = tf.sigmoid(tf.matmul(W_hidden, n_input) + b_hidden)
seems to be working, but then the error appears in:
output = tf.sigmoid(tf.matmul(hidden, W_output))
telling me: line 34 (...) ValueError: Dimensions Dimension(2) and Dimension(5) are not compatible
Turning the statement to:
output = tf.sigmoid(tf.matmul(W_output, hidden))
also throws an exception: line 34 (...) ValueError: Dimensions Dimension(1) and Dimension(5) are not compatible.
EDIT2: I do not really understand this. Shouldn't hidden be W_hidden x n_input.T, since in dimensions this would be (5, 2) x (2, 1)? If I transpose n_input hidden is still working (I even don't get the point why it is working without a transpose at all). However, output keeps throwing errors but this operation in dimensions should be (1, 5) x (5, 1)?!
(0) It's helpful to include the error output - it's also a useful thing to look at, because it does identify exactly where you were having shape problems.
(1) The shape errors arose because you have the arguments to matmul backwards in both of your matmuls, and have the tf.Variable backwards. The general rule is that the weights for layer that has input_size, output_size should be [input_size, output_size], and the matmul should be tf.matmul(input_to_layer, weights_for_layer) (and then add the biases, which are of shape [output_size]).
So with your code,
W_hidden = tf.Variable(tf.random_uniform([hidden_nodes, 2], -1.0, 1.0))
should be:
W_hidden = tf.Variable(tf.random_uniform([2, hidden_nodes], -1.0, 1.0))
and
hidden = tf.sigmoid(tf.matmul(W_hidden, n_input) + b_hidden)
should be tf.matmul(n_input, W_hidden); and
output = tf.sigmoid(tf.matmul(W_output, hidden))
should be tf.matmul(hidden, W_output)
(2) Once you've fixed those bugs, your run needs to be fed a feed_dict:
sess.run(train)
should be:
sess.run(train, feed_dict={n_input: input_data})
At least, I presume that this is what you're trying to achieve.