I'm new to theano and trying to use the examples convolutional network and denoising autoencoder to make a denoising convolutional network. I am currently struggling with how to make W', the reverse weights. In this paper they use tied weights for W' that are flipped in both dimensions.
I'm currently working on a 1d signal, so my image shape is (batch_size, 1, 1, 1000) and filter/W size is (num_kernels, 1, 1, 10) for example. The output of the convolution is then (batch_size, num_kernels, 1, 991).
Since I want to W' to be just the flipped in 2 dimensions (or 1d in my case), I'm tempted to do this
w_value = numpy_rng.uniform(low=-W_bound, high=W_bound, size=filter_shape)
self.W = theano.shared(np.asarray((w_value), dtype=theano.config.floatX), borrow=True)
self.W_prime = T.repeat(self.W[:, :, :, ::-1], num_kernels, axis=1)
where I reverse flip it in the relevant dimension and repeat those weights so that they are the same dimension as the feature maps from the hidden layer.
With this setup, do I only have to get the gradients for W to update or should W_prime also be a part of the grad computation?
When I do it like this, the MSE drops a lot after the first minibatch and then stops changing. Using cross entropy gives NaN from the first iteration. I don't know if that is related to this issue or if it's one of many other potential bugs I have in my code.
I can't comment on the validity of your W_prime approach but I can say that you only need to compute the gradient of the cost with respect to each of the original shared variables. Your W_prime is a symbolic function of W, not a shared variable itself so you don't need to compute gradients with respect to W_prime.
Whenever you get NaNs, the first thing to try is to reduce the size of the learning rate.
Related
When computing the loss between y_true and y_pred, the keras loss functions reduce the dimensionality by one. For example, when training a network on pairs of 64x64 greyscale images with batch size = 8, the shape of y_true and y_pred would be (8, 64, 64). The keras loss functions will produce a loss tensor with shape (8, 64), averaging over the last dimension.
I do not get why that would be necessary, all it does is average the loss over the rows of the image. Doesn't the network need the loss to be calculated individually for every output value (and therefore conserve the shape)? As far as I understand it, backpropagation looks at the individual loss of each output value compared to the target, and then updates previous weights accordingly. How can it do that, just knowing the averaged loss of each row, not every value individually? Here is a code snippet that shows the behaviour I described:
y_true = K.random_uniform([8,64,64])
y_pred = K.random_uniform([8,64,64])
c= mean_absolute_error(y_true,y_pred)
print(K.eval(tf.shape(c))) # (8,64)
I wondered the same thing. I believe, Keras assumes your data to have the following dimensions: [batch, W, H, n_classes] which means averaging over axis=-1 means averaging the loss over all different classes. However, in your case you do not have that dimension because you presumably do a binary classification in a grayscale image. So instead it ends up averaging the loss over the rows/columns. Interestingly enough, the model can still train and even improve in performance like this, which makes me believe that people in similar situations often just train their model without ever noticing.
You can avoid this by adding a dummy axis to your data.
This is how I got there:
From: https://keras.io/api/losses/
"(Note ondN-1: all loss functions reduce by 1 dimension, usually axis=-1.) “
Furthermore: "loss class instances feature a reduction constructor argument, which defaults to "sum_over_batch_size" (i.e. average). Allowable values are "sum_over_batch_size", "sum", and "none":
• "sum_over_batch_size" means the loss instance will return the average of the per-sample losses in the batch.
• "sum" means the loss instance will return the sum of the per-sample losses in the batch.
• "none" means the loss instance will return the full array of per-sample losses. "
From https://www.tensorflow.org/api_docs/python/tf/keras/losses/Reduction
"Caution: Verify the shape of the outputs when using Reduction.NONE. The builtin loss functions wrapped by the loss classes reduce one dimension (axis=-1, or axis if specified by loss function). Reduction.NONE just means that no additional reduction is applied by the class wrapper. For categorical losses with an example input shape of [batch, W, H, n_classes] the n_classes dimension is reduced. For pointwise losses you must include a dummy axis so that [batch, W, H, 1] is reduced to [batch, W, H]. Without the dummy axis [batch, W, H] will be incorrectly reduced to [batch, W]."
Let's say I want to compute the Hessian of a scalar-valued function with respect to some parameters W (e.g the weights and biases of a feed-forward neural network).
If you consider the following code, implementing a two-dimensional linear model trained to minimize a MSE loss:
import numpy as np
import tensorflow as tf
x = tf.placeholder(dtype=tf.float32, shape=[None, 2]) #inputs
t = tf.placeholder(dtype=tf.float32, shape=[None,]) #labels
W = tf.placeholder(np.eye(2), dtype=tf.float32) #weights
preds = tf.matmul(x, W) #linear model
loss = tf.reduce_mean(tf.square(preds-t), axis=0) #mse loss
params = tf.trainable_variables()
hessian = tf.hessians(loss, params)
you'd expect session.run(tf.hessian,feed_dict={}) to return a 2x2 matrix (equal to W). It turns out that because paramsis a 2x2 tensor, the output is rather a tensor with shape [2, 2, 2, 2]. While I can easily reshape the tensor to obtain the matrix I want, it seems that this operation might be extremely cumbersome when paramsbecomes a list of tensors of varying size (i.e when the model is a deep neural network for instance).
It seems that are two ways around this:
Flatten params to be a 1D tensor called flat_params:
flat_params = tf.concat([tf.reshape(p, [-1]) for p in params])
so that tf.hessians(loss, flat_params) naturally returns a 2x2 matrix. However as noted in Why does Tensorflow Reshape tf.reshape() break the flow of gradients? for tf.gradients (but also holds for tf.hessians), tensorflow is not able to see the symbolic link in the graph between paramsand flat_params and tf.hessians(loss, flat_params) will raise an error as the gradients will be seen as None.
In https://afqueiruga.github.io/tensorflow/2017/12/28/hessian-mnist.html, the author of the code goes the other way, and first create the flat parameter and reshapes its parts into self.params. This trick does work and gets you the hessian with its expected shape (2x2 matrix). However, it seems to me that this will be cumbersome to use when you have a complex model, and impossible to apply if you create your model via built-in functions (like tf.layers.dense, ..).
Is there no straight-forward way to get the Hessian matrix (as in the 2x2 matrix in this example) from tf.hessians, when self.params is a list of tensor of arbitrary shapes? If not, how can you automatize the reshaping of the output tensor of tf.hessians?
It turns out (per TensorFlow r1.13) that if len(xs) > 1, then tf.hessians(ys, xs) returns tensors corresponding to only the block diagonal submatrices of the full Hessian matrix. Full story and solutions in this paper https://arxiv.org/pdf/1905.05559, and code at https://github.com/gknilsen/pyhessian
I have an array of 1D input data (30,1). I m trying to map this to output data (30,1) (with noise). I have plotted the data and it is definitely non-linear and continuous.
I want to train a neural network to reproduce this mapping. I am currently trying to complete this task using tensorflow.
My problem right now is that the output data is in an undefined range (e.g. -2.74230671e+01, 1.00000000e+03, 6.34566772e+02 etc), and non-linear tensorflow activation functions seem to all between -1 and 1?
https://www.tensorflow.org/versions/r0.12/api_docs/python/nn/activation_functions_
I am rather new to tensorflow etc, so my question is, how do I approach this problem?
I thought I could mean-normalize the data, but since I don't actually know the range of the output values (possibly unbounded).
Is this possible using tensorflow functions or will I need to build my own? The approach I am using is below, where I tried different functions for tf.nn.relu:
tf_x = tf.placeholder(tf.float32, x.shape) # input x
tf_y = tf.placeholder(tf.float32, y.shape) # output y
# neural network layers
l1 = tf.layers.dense(tf_x, 50, tf.nn.relu) # tried different activation functions here
output = tf.layers.dense(l1, 1) # tried here too
loss = tf.losses.mean_squared_error(tf_y, output)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.05)
train_op = optimizer.minimize(loss)
#train
for step in range(30):
_, l, pred = sess.run([train_op, loss, output], {tf_x: x, tf_y: y})
print(sess.run(loss, feed_dict={tf_x: x, tf_y: y}))
You definitely have to normalize your data for it to work and it does not necessarily have to be in the range [-1, 1].
Take a Computer Vision (CV) problem as an example. What some papers do is simply divide by 255.0. Other papers, compute the mean and standard_deviation of each RGB channel from all the images. To normalize the images, we simply do (x-mu)/sigma over each channel.
Since your data is unbounded like what you said, then we can't simply divide by a scalar. Perhaps the best approach is to normalize based on the data statistics. Specific to your case, you could perhaps find the mean and standard_deviation of each of your 30 dimensions.
This post is more detailed and will potentially help you.
While I'm studying about Neural Network with tensorflow, I got a some question about tf.nn.conv2D(x, W, strides=[1, 1, 1, 1], padding='SAME')
When I input image value x, and weight value W(initailzed by tf.truncated_normal(shape, stddev=0.1)), I understand it will return some value, which is the result of tf.nn.conv2D().
But my question is, when tf.nn.conv2D() is called, Does It change weight value?
If it changes the value of Weight, How does it works? In fact, When I print Weight value, it changes. But I don't know why... My assumption is that value W is some kind of call-by-reference, so while computing tf.nn.conv2D(), the value W is changed. Is it right?
The Tensorflow code flow is not like your conventional programming language. First, a graph is created from the code (which can be visualized using Tensorboard, and then update rules are computed using backpropagation which has been implemented internally.
When you write:
h = tf.nn.conv2D(x, W, strides=[1, 1, 1, 1], padding='SAME')
it creates a convolutional layer in your neural network, which performs convolution (http://cs231n.github.io/convolutional-networks/) on your input matrix and outputs the result in h. Now, the whole purpose of performing such convolutions is to identify some local patterns such as vertical or horizontal edges in the image. For example, a weight matrix W such as
W = [[0,1,0],[0,1,0],[0,1,0]]
would identify vertical edges in the image. However, since W has been initialized randomly here
W = tf.Variable(tf.truncated_normal(shape, stddev=0.1)))
would not be able to find any pattern at the outset. This is solved through backpropagation.
When you train your neural network on a labeled data, at each step the matrix W gets updated such that the derivative of the error E w.r.t W is reduced. You cannot see it happening in your code because the backpropagation is implemented internally in Tensorflow and you only need to write code for the forward pass. If you defined W as
W = tf.Variable(tf.truncated_normal(shape, stddev=0.1)),trainable=False)
it won't be updated, but then the entire purpose of training the parameters would be defeated.
I suggest you to go through http://neuralnetworksanddeeplearning.com to understand how neural networks work, before you proceed with Tensorflow.
I have some background in machine learning and python, but I am just learning TensorFlow. I am going through the tutorial on deep convolutional neural nets to teach myself how to use it for image classification. Along the way there is an exercise, which I am having trouble completing.
EXERCISE: The model architecture in inference() differs slightly from the CIFAR-10 model specified in cuda-convnet. In particular, the top layers of Alex's original model are locally connected and not fully connected. Try editing the architecture to exactly reproduce the locally connected architecture in the top layer.
The exercise refers to the inference() function in the cifar10.py model. The 2nd to last layer (called local4) has a shape=[384, 192], and the top layer has a shape=[192, NUM_CLASSES], where NUM_CLASSES=10 of course. I think the code that we are asked to edit is somewhere in the code defining the top layer:
with tf.variable_scope('softmax_linear') as scope:
weights = _variable_with_weight_decay('weights', [192, NUM_CLASSES],
stddev=1/192.0, wd=0.0)
biases = _variable_on_cpu('biases', [NUM_CLASSES],
tf.constant_initializer(0.0))
softmax_linear = tf.add(tf.matmul(local4, weights), biases,name=scope.name
_activation_summary(softmax_linear)
But I don't see any code that determines the probability of connecting between layers, so I don't know how we can change the model from fully connected to locally connected. Does somebody know how to do this?
I'm also working on this exercise. I'll try and explain my approach properly, rather than just give the solution. It's worth looking back at the mathematics of a fully connected layer (https://www.tensorflow.org/get_started/mnist/beginners).
So the linear algebra for a fully connected layer is:
y = W * x + b
where x is the n dimensional input vector, b is an n dimensional vector of biases, and W is an n-by-n matrix of weights. The i th element of y is the sum of the i th row of W multiplied element-wise with x.
So....if you only want y[i] connected to x[i-1], x[i], and x[i+1], you simply set all values in the i th row of W to zero, apart from the (i-1) th, i th and (i+1) th column of that row. Therefore to create a locally connected layer, you simply enforce W to be a banded matrix (https://en.wikipedia.org/wiki/Band_matrix), where the size of the band is equal to the size of the locally connected neighbourhoods you want. Tensorflow has a function for setting a matrix to be banded (tf.batch_matrix_band_part(input, num_lower, num_upper, name=None)).
This seems to me to be the simplest mathematical solution to the exercise.
I'll try to answer your question although I'm not 100% I got it right as well.
Looking at the cuda-convnet architecture we can see that the TensorFlow and cuda-convnet implementations start to differ after the second pooling layer.
TensorFlow implementation implements two fully connected layers and softmax classifier.
cuda-convnet implements two locally connected layers, one fully connected layer and softmax classifier.
The code snippet you included refers only to the softmax classifier and is in fact shared between the two implementations. To reproduce the cuda-convnet implementation using TensorFlow we have to replace the existing fully connected layers with two locally connected layers and a fully connected one.
Since Tensor doesn't have locally connected layers as part of the SDK we have to figure out a way to implement it using the existing tools. Here is my attempt to implement the first locally connected layers:
with tf.variable_scope('local3') as scope:
shape = pool2.get_shape()
h = shape[1].value
w = shape[2].value
sz_local = 3 # kernel size
sz_patch = (sz_local**2)*shape[3].value
n_channels = 64
# Extract 3x3 tensor patches
patches = tf.extract_image_patches(pool2, [1,sz_local,sz_local,1], [1,1,1,1], [1,1,1,1], 'SAME')
weights = _variable_with_weight_decay('weights', shape=[1,h,w,sz_patch, n_channels], stddev=5e-2, wd=0.0)
biases = _variable_on_cpu('biases', [h,w,n_channels], tf.constant_initializer(0.1))
# "Filter" each patch with its own kernel
mul = tf.multiply(tf.expand_dims(patches, axis=-1), weights)
ssum = tf.reduce_sum(mul, axis=3)
pre_activation = tf.add(ssum, biases)
local3 = tf.nn.relu(pre_activation, name=scope.name)