Let's say I want to compute the Hessian of a scalar-valued function with respect to some parameters W (e.g the weights and biases of a feed-forward neural network).
If you consider the following code, implementing a two-dimensional linear model trained to minimize a MSE loss:
import numpy as np
import tensorflow as tf
x = tf.placeholder(dtype=tf.float32, shape=[None, 2]) #inputs
t = tf.placeholder(dtype=tf.float32, shape=[None,]) #labels
W = tf.placeholder(np.eye(2), dtype=tf.float32) #weights
preds = tf.matmul(x, W) #linear model
loss = tf.reduce_mean(tf.square(preds-t), axis=0) #mse loss
params = tf.trainable_variables()
hessian = tf.hessians(loss, params)
you'd expect session.run(tf.hessian,feed_dict={}) to return a 2x2 matrix (equal to W). It turns out that because paramsis a 2x2 tensor, the output is rather a tensor with shape [2, 2, 2, 2]. While I can easily reshape the tensor to obtain the matrix I want, it seems that this operation might be extremely cumbersome when paramsbecomes a list of tensors of varying size (i.e when the model is a deep neural network for instance).
It seems that are two ways around this:
Flatten params to be a 1D tensor called flat_params:
flat_params = tf.concat([tf.reshape(p, [-1]) for p in params])
so that tf.hessians(loss, flat_params) naturally returns a 2x2 matrix. However as noted in Why does Tensorflow Reshape tf.reshape() break the flow of gradients? for tf.gradients (but also holds for tf.hessians), tensorflow is not able to see the symbolic link in the graph between paramsand flat_params and tf.hessians(loss, flat_params) will raise an error as the gradients will be seen as None.
In https://afqueiruga.github.io/tensorflow/2017/12/28/hessian-mnist.html, the author of the code goes the other way, and first create the flat parameter and reshapes its parts into self.params. This trick does work and gets you the hessian with its expected shape (2x2 matrix). However, it seems to me that this will be cumbersome to use when you have a complex model, and impossible to apply if you create your model via built-in functions (like tf.layers.dense, ..).
Is there no straight-forward way to get the Hessian matrix (as in the 2x2 matrix in this example) from tf.hessians, when self.params is a list of tensor of arbitrary shapes? If not, how can you automatize the reshaping of the output tensor of tf.hessians?
It turns out (per TensorFlow r1.13) that if len(xs) > 1, then tf.hessians(ys, xs) returns tensors corresponding to only the block diagonal submatrices of the full Hessian matrix. Full story and solutions in this paper https://arxiv.org/pdf/1905.05559, and code at https://github.com/gknilsen/pyhessian
Related
I pass a list a to my custom function and I want to tf.tile it after converting it to a constant tensor. The times I tile it depends on the shape of y_true. I don't know how I can get the shape of y_true as integers. Here's the code:
def getloss(a):
a = tf.constant(a, tf.float32)
def loss(y_true, y_pred):
a = tf.reshape(a, [1,1,-1])
ytrue_shape = y_true.get_shape().as_list() #????
multiples = tf.constant([ytrue_shape[0], ytrue_shape[1], 1], tf.int32)
a = tf.tile(a, multiples)
#...
return loss
I have tried y_true.get_shape().as_list() but it reports an error because the first dimension (batch size) is None when compiling the model. Is there any way I can use the shape of y_true here?
When trying to access the shape of a tensor during the building of the model, when not all shapes are known, it is best to use tf.shape. It will be evaluated when the model is ran, as stated in the doc :
tf.shape and Tensor.shape should be identical in eager mode. Within tf.function or within a compat.v1 context, not all dimensions may be known until execution time. Hence when defining custom layers and models for graph mode, prefer the dynamic tf.shape(x) over the static x.shape.
ytrue_shape = tf.shape(y_true)
This will yield a Tensor, so use TF ops to get what you want :
multiples = tf.concat((tf.shape(y_true_shape)[:2],[1]),axis=0)
I am trying to do SVD using a neural network. My input is a matrix (let's say 4x4 matrices only) and the output is a vector representing the decomposed form (given that the input is 4x4 this would be a 36 element vector with 16 elements for U, 4 elements for S, and 16 elements for V.T).
I am trying to define a custom loss function instead of using something like MSE on the decomposed form. So instead of comparing the 36 length vectors for loss, I want to compute the loss between the reconstructed matrices. So if A = U * S * V.T (actual) and A' = U' * S' * V.T' (predicted), I want to compute the loss between A and A'.
I am pretty new to tensorflow and keras, so I may be doing some naive things, but here is what I have so far. While the logic seems okay to me, I get a TypeError: Tensor objects are only iterable when eager execution is enabled. To iterate over this tensor use tf.map_fn. I am not sure why this is the case and how to fix it? Also, do I need to flatten the output from the reconstruct_matrix, as I am currently doing, or should I just leave it as is?
# This function takes the decomposed matrix (vector of U, S, V.T)
# and reconstructs the original matrix
def reconstruct_matrix(decomposed_vector):
example = decomposed_vector
s = np.zeros((4,4))
for en, i in enumerate(example[16:20]):
s[en, en] = i
u = example[:16].reshape(4,4)
vt = example[20:].reshape(4,4)
orig = np.matmul(u, s)
orig = np.matmul(orig, vt)
return orig.flatten() # Given that matrices are 4x4, this will be a length 16 vector
# Custom loss that essentially computes MSE on reconstructed matrices
def custom_loss(y_true, y_pred):
Y = reconstruct_matrix(y_true)
Y_prime = reconstruct_matrix(y_pred)
return K.mean(K.square(Y - Y_prime))
model.compile(optimizer='adam',
loss=custom_loss)
Note: My keras version is 2.2.4 and my tensorflow version is 1.14.0.
In tf1.x eager execution is disabled by default (it's on for version 2 onwards).
You have to enable it by calling at the top of your script:
import tensorflow as tf
tf.enable_eager_execution()
This mode allows you to use Python-like abstractions for flow control (e.g. if statements and for loop you've been using in your code). If it's disabled you need to use Tensorflow functions (tf.cond and tf.while_loop for if and for respectively).
More information about it in the docs.
BTW. I'm not sure about flatten, but remember your y_true and y_pred need the same shape and samples have to be respective to each other, if that's fulfilled you should be fine.
I am a beginner for Tensorflow. I am a bit confused by the tutorial. The author firstly gives a formula y=softmax(Wx+b), but use xW+b in the python code and explain it is a small trick. I do not understand the trick, why does the author need to flip the formula?
https://www.tensorflow.org/get_started/mnist/beginners
First, we multiply x by W with the expression tf.matmul(x, W). This is
flipped from when we multiplied them in our equation, where we had Wx,
as a small trick to deal with x being a 2D tensor with multiple
inputs. We then add b, and finally apply tf.nn.softmax.
As you can see from the formula,
y=softmax(Wx + b)
the input x is multiplied by the Weight variable W,but in the doc
y = tf.nn.softmax(tf.matmul(x, W) + b)
W is multiplied by x for calculation convenience, so we must flip W from 10*784 to 784*10 keep consistent with the formula.
In general in machine learning, esp. tensorflow, you always want your first dimension to represent your batch. The trick is only a way of ensuring that without transposing everything before and after each matrix multiplication.
x is not really a column vector of features, but a 2D matrix of shape (batch_size, n_features).
If you keep Wx, then you'll transpose x (to x' of shape (n_features, batch_size)) use W of shape (n_outputs, n_features), and Wx' will be of shape (n_outputs, batch_size), so you'll have to transpose it back to (batch_size, n_outputs), which is what you want in the end.
If you're using tf.matmul(x, W), then W is of shape (n_features, n_outputs ), and the result is directly of shape (batch_size, n_outputs).
I agree this is not clear at first.
x being a 2D tensor with multiple inputs
is a very succinct way to tell you that in tensorflow, data is stored in tensors following conventions that are not those of linear algebra.
In particular, the outermost dimension (i.e. columns for matrices) is always the sample dimension: that is, it has the same size as your number of samples.
When you store sample features in a 2D tensor (a matrix), the features are therefore stored in the inner-most dimension, i.e. lines. That is, tensor x is the transposed of variable $x$ in the equation. So are W and b. The fact that x.T*W.T=(W.x).T explains the swap inconsistency in the multiplication between the linear algebra equation and the tensor implementation of it.
I am working on a siamese CNN with attention in TensorFlow.
The CNN structure consists on a embedding lookup table shared by two CNN sharing weights.
The inputs for the network are two matrices, both containing indices for question and answer to be fed into the network (batch_size x sentence_length):
self.input_q = tf.placeholder(tf.int32, [None, sentence_length], name="input_q")
self.input_a = tf.placeholder(tf.int32, [None, sentence_length], name="input_a")
After embedding each sentence (row from the input matrix) I end up with two tensors (questions and answer) each of them of size: batch_size x sentence_lentgh x embedding_size.
Let's forget for now about the batch dimension to make things easier. This is to say, we have two matrices Qemb and Aemb, both sentence_lentgh x embedding_size.
From this two matrices I would like to construct a third one, an attention matrix A used for a posterior learnable attention feature matrix , using numpy would be defined as follows:
A[i,j] = 1.0 / (1.0 + np.linalg.norm(Qemb[i,:]-Aemb[j,:]))
This matrix is built for each input pair, so should be a part of the graph, but apparently this cannot be done in TensorFlow as there's no asingn operation by index for a Tensor.
Am I right?
I thought I could run the ops for embedding the question and answer, build the A matrix outside the graph given the computedembeddings and then feed the A matrix back to the graph to continue the next operations based on it.
self.attention_matrix = \
tf.placeholder(tf.float32,
[None, sentence_length, sentence_length],
name = "Attention_matrix")
Is there any problem with this approach that I might not be aware of?
(Appart from runing the embeddings ops twice, what doesn't seem optimal, but not a big deal)
I'm new to theano and trying to use the examples convolutional network and denoising autoencoder to make a denoising convolutional network. I am currently struggling with how to make W', the reverse weights. In this paper they use tied weights for W' that are flipped in both dimensions.
I'm currently working on a 1d signal, so my image shape is (batch_size, 1, 1, 1000) and filter/W size is (num_kernels, 1, 1, 10) for example. The output of the convolution is then (batch_size, num_kernels, 1, 991).
Since I want to W' to be just the flipped in 2 dimensions (or 1d in my case), I'm tempted to do this
w_value = numpy_rng.uniform(low=-W_bound, high=W_bound, size=filter_shape)
self.W = theano.shared(np.asarray((w_value), dtype=theano.config.floatX), borrow=True)
self.W_prime = T.repeat(self.W[:, :, :, ::-1], num_kernels, axis=1)
where I reverse flip it in the relevant dimension and repeat those weights so that they are the same dimension as the feature maps from the hidden layer.
With this setup, do I only have to get the gradients for W to update or should W_prime also be a part of the grad computation?
When I do it like this, the MSE drops a lot after the first minibatch and then stops changing. Using cross entropy gives NaN from the first iteration. I don't know if that is related to this issue or if it's one of many other potential bugs I have in my code.
I can't comment on the validity of your W_prime approach but I can say that you only need to compute the gradient of the cost with respect to each of the original shared variables. Your W_prime is a symbolic function of W, not a shared variable itself so you don't need to compute gradients with respect to W_prime.
Whenever you get NaNs, the first thing to try is to reduce the size of the learning rate.