Why tensors not connected to gradients in TensorBoard?

Why tensors not connected to gradients in TensorBoard? - python

For practice, I wanted to implement a model in tensorflow which gives me back the square of the input. My code works correctly, but when I have a look at the computation graph in TensorBoard, the LOSS operation is not connected to the Gradients subgraph and neither to Adam. Why is this? As I understand, the compute the gradients, tensorflow has to derivate the loss.
Here is my code:
import numpy as np
import tensorflow as tf
np_inp = np.array([3, 6, 4, 2, 9, 11, 0.48, 22, -2.3, -0.48])
np_outp = np.power(np_inp, 2)
inputs = tf.Variable(np_inp, name='input', trainable=False)
outputs = tf.Variable(np_outp, name='output', trainable=False)
multiplier = tf.Variable(0.1,
dtype=tf.float64, trainable=True, name='multiplier')
mul = inputs * multiplier
predict = tf.square(mul, name='prediction')
loss = tf.math.reduce_sum(tf.math.square(predict-outputs), name='LOSS')
optimizer = tf.train.AdamOptimizer(0.1)
to_minimize = optimizer.minimize(loss)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
logs_path = "./logs/unt" # path to the folder that we want to save the logs for Tensorboard
train_writer = tf.summary.FileWriter(logs_path, sess.graph)
for i in range(100):
sess.run(to_minimize)
print(sess.run({'mult':multiplier}))
Tensorboard:
https://gofile.io/?c=jxbWiG
Thanks in advance!

This can be counter intuitive, but the actual value of the loss is not used for the training itself (although it can be useful to plot it to see its progress). What optimizers generally use is the gradient, that is, how each change in each variable would affect the loss value. To compute this, a tensor with the same shape as LOSS but filled with ones is created, and the gradient of each operation is computed through back-propagation. If you open the gradients box in the graph, you will see a LOSS_grad box representing this.
It is a couple of nodes making that tensor of ones, because the gradient of something with respect to itself is always one. From there, the rest of gradients are computed.

Related

The logits in the loss in tensorflow can be a placeholder

I use tensorflow to implement handwritten digit recognition. I hope that the logits in softmax_cross_entropy_with_logits are first represented by a placeholder, and then passed to the placeholder by the calculated value when calculating, but tensorflow will report error ValueError: No gradients provided for any variable, check Your graph for ops that do not support gradients. I know that it is ok to change the logits directly to outputs, but if I have to use logits, the result is a placeholder first. How should I solve it?
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/home/as/downloads/resnet-152_mnist-master/mnist_dataset", one_hot=True)
from tensorflow.contrib.layers import fully_connected
x = tf.placeholder(dtype=tf.float32,shape=[None,784])
y = tf.placeholder(dtype=tf.float32,shape=[None,10])
hidden1 = fully_connected(x,100,activation_fn=tf.nn.elu,
weights_initializer=tf.random_normal_initializer())
hidden2 = fully_connected(hidden1,200,activation_fn=tf.nn.elu,
weights_initializer=tf.random_normal_initializer())
hidden3 = fully_connected(hidden2,200,activation_fn=tf.nn.elu,
weights_initializer=tf.random_normal_initializer())
outputs = fully_connected(hidden3,10,activation_fn=None,
weights_initializer=tf.random_normal_initializer())
a = tf.placeholder(tf.float32,[None,10])
loss = tf.nn.softmax_cross_entropy_with_logits(labels=y,logits=a)
reduce_mean_loss = tf.reduce_mean(loss)
equal_result = tf.equal(tf.argmax(outputs,1),tf.argmax(y,1))
cast_result = tf.cast(equal_result,dtype=tf.float32)
accuracy = tf.reduce_mean(cast_result)
train_op = tf.train.AdamOptimizer(0.001).minimize(reduce_mean_loss)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(30000):
xs,ys = mnist.train.next_batch(128)
result = outputs.eval(feed_dict={x:xs})
sess.run(train_op,feed_dict={a:result,y:ys})
print(i)

To be brief, the logits in your loss can't be a placeholder, but need to be a tensorflow Operation. Otherwise, your optimizer can't calculate the gradient w.r.t any variables (see error message).
Operations are "a graph node that performs computation on tensors", whereas a placeholder is a tensor that needs to be fed, when evaluating the graph.
I don't really understand, why you don't directly assign the outputs operation to logits, like so:
loss = tf.nn.softmax_cross_entropy_with_logits(labels=y,logits=outputs)
I could try to further help you, if you provide a special use case?

How is embedding matrix being trained in this code snippet?

I'm following the code of a coursera assignment which implements a NER tagger using a bidirectional LSTM.
But I'm not able to understand how the embedding matrix is being updated. In the following code, build_layers has a variable embedding_matrix_variable which acts an input the the LSTM. However it's not getting updated anywhere.
Can you help me understand how embeddings are being trained?
def build_layers(self, vocabulary_size, embedding_dim, n_hidden_rnn, n_tags):
initial_embedding_matrix = np.random.randn(vocabulary_size, embedding_dim) / np.sqrt(embedding_dim)
embedding_matrix_variable = tf.Variable(initial_embedding_matrix, name='embedding_matrix', dtype=tf.float32)
forward_cell = tf.nn.rnn_cell.DropoutWrapper(
tf.nn.rnn_cell.BasicLSTMCell(num_units=n_hidden_rnn, forget_bias=3.0),
input_keep_prob=self.dropout_ph,
output_keep_prob=self.dropout_ph,
state_keep_prob=self.dropout_ph
)
backward_cell = tf.nn.rnn_cell.DropoutWrapper(
tf.nn.rnn_cell.BasicLSTMCell(num_units=n_hidden_rnn, forget_bias=3.0),
input_keep_prob=self.dropout_ph,
output_keep_prob=self.dropout_ph,
state_keep_prob=self.dropout_ph
)
embeddings = tf.nn.embedding_lookup(embedding_matrix_variable, self.input_batch)
(rnn_output_fw, rnn_output_bw), _ = tf.nn.bidirectional_dynamic_rnn(
cell_fw=forward_cell, cell_bw=backward_cell,
dtype=tf.float32,
inputs=embeddings,
sequence_length=self.lengths
)
rnn_output = tf.concat([rnn_output_fw, rnn_output_bw], axis=2)
self.logits = tf.layers.dense(rnn_output, n_tags, activation=None)
def compute_loss(self, n_tags, PAD_index):
"""Computes masked cross-entopy loss with logits."""
ground_truth_tags_one_hot = tf.one_hot(self.ground_truth_tags, n_tags)
loss_tensor = tf.nn.softmax_cross_entropy_with_logits(labels=ground_truth_tags_one_hot, logits=self.logits)
mask = tf.cast(tf.not_equal(self.input_batch, PAD_index), tf.float32)
self.loss = tf.reduce_mean(tf.reduce_sum(tf.multiply(loss_tensor, mask), axis=-1) / tf.reduce_sum(mask, axis=-1))

In TensorFlow, variables are not usually updated directly (i.e. by manually setting them to a certain value), but rather they are trained using an optimization algorithm and automatic differentiation.
When you define a tf.Variable, you are adding a node (that maintains a state) to the computational graph. At training time, if the loss node depends on the state of the variable that you defined, TensorFlow will compute the gradient of the loss function with respect to that variable by automatically following the chain rule through the computational graph. Then, the optimization algorithm will make use of the computed gradients to update the values of the trainable variables that took part in the computation of the loss.
Concretely, the code that you provide builds a TensorFlow graph in which the loss self.loss depends on the weights in embedding_matrix_variable (i.e. there is a path between these nodes in the graph), so TensorFlow will compute the gradient with respect to this variable, and the optimizer will update its values when minimizing the loss. It might be useful to inspect the TensorFlow graph using TensorBoard.

XOR neural network, the losses don't go down

I'm using Mxnet to train a XOR neural network, but the losses don't go down, they are always above 0.5.
Below is my code in Mxnet 1.1.0; Python 3.6; OS X El Capitan 10.11.6
I tried 2 loss functions - squared loss and softmax loss, both didn't work.
from mxnet import ndarray as nd
from mxnet import autograd
from mxnet import gluon
import matplotlib.pyplot as plt
X = nd.array([[0,0],[0,1],[1,0],[1,1]])
y = nd.array([0,1,1,0])
batch_size = 1
dataset = gluon.data.ArrayDataset(X, y)
data_iter = gluon.data.DataLoader(dataset, batch_size, shuffle=True)
plt.scatter(X[:, 1].asnumpy(),y.asnumpy())
plt.show()
net = gluon.nn.Sequential()
with net.name_scope():
net.add(gluon.nn.Dense(2, activation="tanh"))
net.add(gluon.nn.Dense(1, activation="tanh"))
net.initialize()
softmax_cross_entropy = gluon.loss.SigmoidBCELoss()#SigmoidBinaryCrossEntropyLoss()
square_loss = gluon.loss.L2Loss()
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.3})
train_losses = []
for epoch in range(100):
train_loss = 0
for data, label in data_iter:
with autograd.record():
output = net(data)
loss = square_loss(output, label)
loss.backward()
trainer.step(batch_size)
train_loss += nd.mean(loss).asscalar()
train_losses.append(train_loss)
plt.plot(train_losses)
plt.show()

I got this question figured out in somewhere else, so I'm going to post the answer here.
Basically, the issue in my original code was multi-dimensional.
Weight initialization. Notice that I used default initialization
net.initialize()
which actually does
net.initialize(initializer.Uniform(scale=0.07))
Apparently these initial weights were too small, and the network could never jump out of them. So the fix is
net.initialize(mx.init.Uniform(1))
After doing this, the network could converge using sigmoid/tanh as the activation, and using L2Loss as the loss function. And it worked with sigmoid and SigmoidBCELoss. However, it still didn't work with tanh and SigmoidBCELoss, which can be fixed by the second item below.
SigmoidBCELoss has to be used in these 2 scenarios in the output layer.
2.1. Linear activation and SigmoidBCELoss(from_sigmoid=False);
2.2. Non-linear activation and SigmoidBCELoss(from_sigmoid=True), in which the output of the non-linear function falls into (0, 1).
In my original code, when I used SigmoidBCELoss, I was using either all sigmoid, or all tanh. So just need to change the activation in the output layer from tanh to sigmoid, and the network could converge. I can still have tanh in the hidden layers.
Hope this helps!

Tensorflow: How to set the learning rate in log scale and some Tensorflow questions

I am a deep learning and Tensorflow beginner and I am trying to implement the algorithm in this paper using Tensorflow. This paper uses Matconvnet+Matlab to implement it, and I am curious if Tensorflow has the equivalent functions to achieve the same thing. The paper said:
The network parameters were initialized using the Xavier method [14]. We used the regression loss across four wavelet subbands under l2 penalty and the proposed network was trained by using the stochastic gradient descent (SGD). The regularization parameter (λ) was 0.0001 and the momentum was 0.9. The learning rate was set from 10−1 to 10−4 which was reduced in log scale at each epoch.
This paper uses wavelet transform (WT) and residual learning method (where the residual image = WT(HR) - WT(HR'), and the HR' are used for training). Xavier method suggests to initialize the variables normal distribution with
stddev=sqrt(2/(filter_size*filter_size*num_filters)
Q1. How should I initialize the variables? Is the code below correct?
weights = tf.Variable(tf.random_normal[img_size, img_size, 1, num_filters], stddev=stddev)
This paper does not explain how to construct the loss function in details . I am unable to find the equivalent Tensorflow function to set the learning rate in log scale (only exponential_decay). I understand MomentumOptimizer is equivalent to Stochastic Gradient Descent with momentum.
Q2: Is it possible to set the learning rate in log scale?
Q3: How to create the loss function described above?
I followed this website to write the code below. Assume model() function returns the network mentioned in this paper and lamda=0.0001,
inputs = tf.placeholder(tf.float32, shape=[None, patch_size, patch_size, num_channels])
labels = tf.placeholder(tf.float32, [None, patch_size, patch_size, num_channels])
# get the model output and weights for each conv
pred, weights = model()
# define loss function
loss = tf.nn.softmax_cross_entropy_with_logits_v2(labels=labels, logits=pred)
for weight in weights:
regularizers += tf.nn.l2_loss(weight)
loss = tf.reduce_mean(loss + 0.0001 * regularizers)
learning_rate = tf.train.exponential_decay(???) # Not sure if we can have custom learning rate for log scale
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum).minimize(loss, global_step)
NOTE: As I am a deep learning/Tensorflow beginner, I copy-paste code here and there so please feel free to correct it if you can ;)

Q1. How should I initialize the variables? Is the code below correct?
Use tf.get_variable or switch to slim (it does the initialization automatically for you). example
Q2: Is it possible to set the learning rate in log scale?
You can but do you need it? This is not the first thing that you need to solve in this network. Please check #3
However, just for reference, use following notation.
learning_rate_node = tf.train.exponential_decay(learning_rate=0.001, decay_steps=10000, decay_rate=0.98, staircase=True)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate_node).minimize(loss)
Q3: How to create the loss function described above?
At first, you have not written "pred" to "image" conversion to this message(Based on the paper you need to apply subtraction and IDWT to obtain final image).
There is one problem here, logits have to be calculated based on your label data. i.e. if you will use marked data as "Y : Label", you need to write
pred = model()
pred = tf.matmul(pred, weights) + biases
logits = tf.nn.softmax(pred)
loss = tf.reduce_mean(tf.abs(logits - labels))
This will give you the output of Y : Label to be used
If your dataset's labeled images are denoised ones, in this case you need to follow this one:
pred = model()
pred = tf.matmul(image, weights) + biases
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(image - labels))
Logits are the output of your network. You will use this one as result to calculate the rest. Instead of matmul, you can add a conv2d layer in here without a batch normalization and an activation function and set output feature count as 4. Example:
pred = model()
pred = slim.conv2d(pred, 4, [3, 3], activation_fn=None, padding='SAME', scope='output')
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(logits - labels))
This loss function will give you basic training capabilities. However, this is L1 distance and it may suffer from some issues (check). Think following situation
Let's say you have following array as output [10, 10, 10, 0, 0] and you try to achieve [10, 10, 10, 10, 10]. In this case, your loss is 20 (10 + 10). However, you have 3/5 success. Also, it may indicate some overfit.
For same case, think following output [6, 6, 6, 6, 6]. It still has loss of 20 (4 + 4 + 4 + 4 + 4). However, whenever you apply threshold of 5, you can achieve 5/5 success. Hence, this is the case that we want.
If you use L2 loss, for the first case, you will have 10^2 + 10^2 = 200 as loss output. For the second case, you will get 4^2 * 5 = 80.
Hence, optimizer will try to run away from #1 as quick as possible to achieve global success rather than perfect success of some outputs and complete failure of the others. You can apply loss function like this for that.
tf.reduce_mean(tf.nn.l2_loss(logits - image))
Alternatively, you can check for cross entropy loss function. (it does apply softmax internally, do not apply softmax twice)
tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, image))

Q1. How should I initialize the variables? Is the code below correct?
That's correct (although missing an opening parentheses). You could also look into tf.get_variable if the variables are going to be reused.
Q2: Is it possible to set the learning rate in log scale?
Exponential decay decreases the learning rate at every step. I think what you want is tf.train.piecewise_constant, and set boundaries at each epoch.
EDIT: Look at the other answer, use the staircase=True argument!
Q3: How to create the loss function described above?
Your loss function looks correct.

Other answers are very detailed and helpful. Here is a code example that uses placeholder to decay learning rate at log scale. HTH.
import tensorflow as tf
import numpy as np
# data simulation
N = 10000
D = 10
x = np.random.rand(N, D)
w = np.random.rand(D,1)
y = np.dot(x, w)
print y.shape
#modeling
batch_size = 100
tni = tf.truncated_normal_initializer()
X = tf.placeholder(tf.float32, [batch_size, D])
Y = tf.placeholder(tf.float32, [batch_size,1])
W = tf.get_variable("w", shape=[D,1], initializer=tni)
B = tf.zeros([1])
lr = tf.placeholder(tf.float32)
pred = tf.add(tf.matmul(X,W), B)
print pred.shape
mse = tf.reduce_sum(tf.losses.mean_squared_error(Y, pred))
opt = tf.train.MomentumOptimizer(lr, 0.9)
train_op = opt.minimize(mse)
learning_rate = 0.0001
do_train = True
acc_err = 0.0
sess = tf.Session()
sess.run(tf.global_variables_initializer())
while do_train:
for i in range (100000):
if i > 0 and i % N == 0:
# epoch done, decrease learning rate by 2
learning_rate /= 2
print "Epoch completed. LR =", learning_rate
idx = i/batch_size + i%batch_size
f = {X:x[idx:idx+batch_size,:], Y:y[idx:idx+batch_size,:], lr: learning_rate}
_, err = sess.run([train_op, mse], feed_dict = f)
acc_err += err
if i%5000 == 0:
print "Average error = {}".format(acc_err/5000)
acc_err = 0.0

Train a tensorflow model minimizing the loss of several batches

I would like to train the weights of a model based on the sum of the loss value of several batches. However it seems that once you run the graph for each of the individual batches, the object that is returned is just a regular numpy array. So when you try and use an optimizer like GradientDescentOptimizer, it no longer has information about the variables that were used to calculate the sum of the losses, so it can't find the gradients of the weights that what help minimize the loss. Here's an example tensorflow script to illustrate what I'm talking about:
weights = tf.Variable(tf.ones([num_feature_values], tf.float32))
feature_values = tf.placeholder(tf.int32, shape=[num_feature_values])
labels = tf.placeholder(tf.int32, shape=[1])
loss_op = some_loss_function(weights, feature_values, labels)
with tf.Session() as sess:
for batch in batches:
feed_dict = fill_feature_values_and_labels(batch)
#Calculates loss for one batch
loss = sess.run(loss_op, feed_dict=feed_dict)
#Adds it to total loss
total_loss += loss
# Want to train weights to minimize total_loss, however this
# doesn't work because the graph has already been run.
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(total_loss)
with tf.Session() as sess:
for step in xrange(num_steps):
sess.run(optimizer)
The total_loss is a numpy array and thus cannot be used in the optimizer. Does anyone know a way around the problem, where I want to use information across many batches but still need the graph intact in order to preserve the fact that the total_loss is a function of the weights?

The thing you optimize in any of the trainers must be a part of the graph, here what you train on is the actual realized result, so it won't work.
I think the way you should probably do this is to construct your input as a batch of batches e.g.
intput = tf.placeholder("float", (number_of_batches, batch_size, input_size)
Then have your target also be a 3d tensor which can be trained on.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why tensors not connected to gradients in TensorBoard? - python

Related

The logits in the loss in tensorflow can be a placeholder

How is embedding matrix being trained in this code snippet?

XOR neural network, the losses don't go down

Tensorflow: How to set the learning rate in log scale and some Tensorflow questions

Train a tensorflow model minimizing the loss of several batches

Categories

Resources