While I'm studying about Neural Network with tensorflow, I got a some question about tf.nn.conv2D(x, W, strides=[1, 1, 1, 1], padding='SAME')
When I input image value x, and weight value W(initailzed by tf.truncated_normal(shape, stddev=0.1)), I understand it will return some value, which is the result of tf.nn.conv2D().
But my question is, when tf.nn.conv2D() is called, Does It change weight value?
If it changes the value of Weight, How does it works? In fact, When I print Weight value, it changes. But I don't know why... My assumption is that value W is some kind of call-by-reference, so while computing tf.nn.conv2D(), the value W is changed. Is it right?
The Tensorflow code flow is not like your conventional programming language. First, a graph is created from the code (which can be visualized using Tensorboard, and then update rules are computed using backpropagation which has been implemented internally.
When you write:
h = tf.nn.conv2D(x, W, strides=[1, 1, 1, 1], padding='SAME')
it creates a convolutional layer in your neural network, which performs convolution (http://cs231n.github.io/convolutional-networks/) on your input matrix and outputs the result in h. Now, the whole purpose of performing such convolutions is to identify some local patterns such as vertical or horizontal edges in the image. For example, a weight matrix W such as
W = [[0,1,0],[0,1,0],[0,1,0]]
would identify vertical edges in the image. However, since W has been initialized randomly here
W = tf.Variable(tf.truncated_normal(shape, stddev=0.1)))
would not be able to find any pattern at the outset. This is solved through backpropagation.
When you train your neural network on a labeled data, at each step the matrix W gets updated such that the derivative of the error E w.r.t W is reduced. You cannot see it happening in your code because the backpropagation is implemented internally in Tensorflow and you only need to write code for the forward pass. If you defined W as
W = tf.Variable(tf.truncated_normal(shape, stddev=0.1)),trainable=False)
it won't be updated, but then the entire purpose of training the parameters would be defeated.
I suggest you to go through http://neuralnetworksanddeeplearning.com to understand how neural networks work, before you proceed with Tensorflow.
Related
I'm trying to figure out how to backpropagate a GRU Recurrent network, but I'm having trouble understanding the GRU architecture precisely.
The image below shows a GRU cell with 3 neural networks, receiving the concatenated previous hidden state and the input vector as its input.
GRU example
This image used I referenced for backpropagation, however, shows the inputs being forwarded into W and U for each of the gates, added, and then having their appropriate activation functions applied.
GRU Backpropagation
the equation for the update gate shown on wikipedia is as shown here as an example
zt = sigmoid((W(z)xt + U(z)ht-1))
can somebody explain to me what W and U represent?
EDIT:
in most of the sources I found, W and U are usually referred to as "weights", so my best guess is that W and U represent their own neural networks, but this would contradict the image I found before.
if somebody could give an example of how W and U would work in a simple GRU, that would be helpful.
Sources for the images:
https://cran.r-project.org/web/packages/rnn/vignettes/GRU_units.html
https://towardsdatascience.com/animated-rnn-lstm-and-gru-ef124d06cf45
W and U are matrices whose values are learnt during training (a.k.a. neural network weights). The matrix W multiplies the vector xt and produces a new vector. Similarly, the matrix U multiplies the vector ht-1 and produces a new vector. Those two new vectors are added together and then each component of the result is passed to the sigmoid function.
I'm following the code of a coursera assignment which implements a NER tagger using a bidirectional LSTM.
But I'm not able to understand how the embedding matrix is being updated. In the following code, build_layers has a variable embedding_matrix_variable which acts an input the the LSTM. However it's not getting updated anywhere.
Can you help me understand how embeddings are being trained?
def build_layers(self, vocabulary_size, embedding_dim, n_hidden_rnn, n_tags):
initial_embedding_matrix = np.random.randn(vocabulary_size, embedding_dim) / np.sqrt(embedding_dim)
embedding_matrix_variable = tf.Variable(initial_embedding_matrix, name='embedding_matrix', dtype=tf.float32)
forward_cell = tf.nn.rnn_cell.DropoutWrapper(
tf.nn.rnn_cell.BasicLSTMCell(num_units=n_hidden_rnn, forget_bias=3.0),
input_keep_prob=self.dropout_ph,
output_keep_prob=self.dropout_ph,
state_keep_prob=self.dropout_ph
)
backward_cell = tf.nn.rnn_cell.DropoutWrapper(
tf.nn.rnn_cell.BasicLSTMCell(num_units=n_hidden_rnn, forget_bias=3.0),
input_keep_prob=self.dropout_ph,
output_keep_prob=self.dropout_ph,
state_keep_prob=self.dropout_ph
)
embeddings = tf.nn.embedding_lookup(embedding_matrix_variable, self.input_batch)
(rnn_output_fw, rnn_output_bw), _ = tf.nn.bidirectional_dynamic_rnn(
cell_fw=forward_cell, cell_bw=backward_cell,
dtype=tf.float32,
inputs=embeddings,
sequence_length=self.lengths
)
rnn_output = tf.concat([rnn_output_fw, rnn_output_bw], axis=2)
self.logits = tf.layers.dense(rnn_output, n_tags, activation=None)
def compute_loss(self, n_tags, PAD_index):
"""Computes masked cross-entopy loss with logits."""
ground_truth_tags_one_hot = tf.one_hot(self.ground_truth_tags, n_tags)
loss_tensor = tf.nn.softmax_cross_entropy_with_logits(labels=ground_truth_tags_one_hot, logits=self.logits)
mask = tf.cast(tf.not_equal(self.input_batch, PAD_index), tf.float32)
self.loss = tf.reduce_mean(tf.reduce_sum(tf.multiply(loss_tensor, mask), axis=-1) / tf.reduce_sum(mask, axis=-1))
In TensorFlow, variables are not usually updated directly (i.e. by manually setting them to a certain value), but rather they are trained using an optimization algorithm and automatic differentiation.
When you define a tf.Variable, you are adding a node (that maintains a state) to the computational graph. At training time, if the loss node depends on the state of the variable that you defined, TensorFlow will compute the gradient of the loss function with respect to that variable by automatically following the chain rule through the computational graph. Then, the optimization algorithm will make use of the computed gradients to update the values of the trainable variables that took part in the computation of the loss.
Concretely, the code that you provide builds a TensorFlow graph in which the loss self.loss depends on the weights in embedding_matrix_variable (i.e. there is a path between these nodes in the graph), so TensorFlow will compute the gradient with respect to this variable, and the optimizer will update its values when minimizing the loss. It might be useful to inspect the TensorFlow graph using TensorBoard.
I have an array of 1D input data (30,1). I m trying to map this to output data (30,1) (with noise). I have plotted the data and it is definitely non-linear and continuous.
I want to train a neural network to reproduce this mapping. I am currently trying to complete this task using tensorflow.
My problem right now is that the output data is in an undefined range (e.g. -2.74230671e+01, 1.00000000e+03, 6.34566772e+02 etc), and non-linear tensorflow activation functions seem to all between -1 and 1?
https://www.tensorflow.org/versions/r0.12/api_docs/python/nn/activation_functions_
I am rather new to tensorflow etc, so my question is, how do I approach this problem?
I thought I could mean-normalize the data, but since I don't actually know the range of the output values (possibly unbounded).
Is this possible using tensorflow functions or will I need to build my own? The approach I am using is below, where I tried different functions for tf.nn.relu:
tf_x = tf.placeholder(tf.float32, x.shape) # input x
tf_y = tf.placeholder(tf.float32, y.shape) # output y
# neural network layers
l1 = tf.layers.dense(tf_x, 50, tf.nn.relu) # tried different activation functions here
output = tf.layers.dense(l1, 1) # tried here too
loss = tf.losses.mean_squared_error(tf_y, output)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.05)
train_op = optimizer.minimize(loss)
#train
for step in range(30):
_, l, pred = sess.run([train_op, loss, output], {tf_x: x, tf_y: y})
print(sess.run(loss, feed_dict={tf_x: x, tf_y: y}))
You definitely have to normalize your data for it to work and it does not necessarily have to be in the range [-1, 1].
Take a Computer Vision (CV) problem as an example. What some papers do is simply divide by 255.0. Other papers, compute the mean and standard_deviation of each RGB channel from all the images. To normalize the images, we simply do (x-mu)/sigma over each channel.
Since your data is unbounded like what you said, then we can't simply divide by a scalar. Perhaps the best approach is to normalize based on the data statistics. Specific to your case, you could perhaps find the mean and standard_deviation of each of your 30 dimensions.
This post is more detailed and will potentially help you.
I have some background in machine learning and python, but I am just learning TensorFlow. I am going through the tutorial on deep convolutional neural nets to teach myself how to use it for image classification. Along the way there is an exercise, which I am having trouble completing.
EXERCISE: The model architecture in inference() differs slightly from the CIFAR-10 model specified in cuda-convnet. In particular, the top layers of Alex's original model are locally connected and not fully connected. Try editing the architecture to exactly reproduce the locally connected architecture in the top layer.
The exercise refers to the inference() function in the cifar10.py model. The 2nd to last layer (called local4) has a shape=[384, 192], and the top layer has a shape=[192, NUM_CLASSES], where NUM_CLASSES=10 of course. I think the code that we are asked to edit is somewhere in the code defining the top layer:
with tf.variable_scope('softmax_linear') as scope:
weights = _variable_with_weight_decay('weights', [192, NUM_CLASSES],
stddev=1/192.0, wd=0.0)
biases = _variable_on_cpu('biases', [NUM_CLASSES],
tf.constant_initializer(0.0))
softmax_linear = tf.add(tf.matmul(local4, weights), biases,name=scope.name
_activation_summary(softmax_linear)
But I don't see any code that determines the probability of connecting between layers, so I don't know how we can change the model from fully connected to locally connected. Does somebody know how to do this?
I'm also working on this exercise. I'll try and explain my approach properly, rather than just give the solution. It's worth looking back at the mathematics of a fully connected layer (https://www.tensorflow.org/get_started/mnist/beginners).
So the linear algebra for a fully connected layer is:
y = W * x + b
where x is the n dimensional input vector, b is an n dimensional vector of biases, and W is an n-by-n matrix of weights. The i th element of y is the sum of the i th row of W multiplied element-wise with x.
So....if you only want y[i] connected to x[i-1], x[i], and x[i+1], you simply set all values in the i th row of W to zero, apart from the (i-1) th, i th and (i+1) th column of that row. Therefore to create a locally connected layer, you simply enforce W to be a banded matrix (https://en.wikipedia.org/wiki/Band_matrix), where the size of the band is equal to the size of the locally connected neighbourhoods you want. Tensorflow has a function for setting a matrix to be banded (tf.batch_matrix_band_part(input, num_lower, num_upper, name=None)).
This seems to me to be the simplest mathematical solution to the exercise.
I'll try to answer your question although I'm not 100% I got it right as well.
Looking at the cuda-convnet architecture we can see that the TensorFlow and cuda-convnet implementations start to differ after the second pooling layer.
TensorFlow implementation implements two fully connected layers and softmax classifier.
cuda-convnet implements two locally connected layers, one fully connected layer and softmax classifier.
The code snippet you included refers only to the softmax classifier and is in fact shared between the two implementations. To reproduce the cuda-convnet implementation using TensorFlow we have to replace the existing fully connected layers with two locally connected layers and a fully connected one.
Since Tensor doesn't have locally connected layers as part of the SDK we have to figure out a way to implement it using the existing tools. Here is my attempt to implement the first locally connected layers:
with tf.variable_scope('local3') as scope:
shape = pool2.get_shape()
h = shape[1].value
w = shape[2].value
sz_local = 3 # kernel size
sz_patch = (sz_local**2)*shape[3].value
n_channels = 64
# Extract 3x3 tensor patches
patches = tf.extract_image_patches(pool2, [1,sz_local,sz_local,1], [1,1,1,1], [1,1,1,1], 'SAME')
weights = _variable_with_weight_decay('weights', shape=[1,h,w,sz_patch, n_channels], stddev=5e-2, wd=0.0)
biases = _variable_on_cpu('biases', [h,w,n_channels], tf.constant_initializer(0.1))
# "Filter" each patch with its own kernel
mul = tf.multiply(tf.expand_dims(patches, axis=-1), weights)
ssum = tf.reduce_sum(mul, axis=3)
pre_activation = tf.add(ssum, biases)
local3 = tf.nn.relu(pre_activation, name=scope.name)
I'm new to theano and trying to use the examples convolutional network and denoising autoencoder to make a denoising convolutional network. I am currently struggling with how to make W', the reverse weights. In this paper they use tied weights for W' that are flipped in both dimensions.
I'm currently working on a 1d signal, so my image shape is (batch_size, 1, 1, 1000) and filter/W size is (num_kernels, 1, 1, 10) for example. The output of the convolution is then (batch_size, num_kernels, 1, 991).
Since I want to W' to be just the flipped in 2 dimensions (or 1d in my case), I'm tempted to do this
w_value = numpy_rng.uniform(low=-W_bound, high=W_bound, size=filter_shape)
self.W = theano.shared(np.asarray((w_value), dtype=theano.config.floatX), borrow=True)
self.W_prime = T.repeat(self.W[:, :, :, ::-1], num_kernels, axis=1)
where I reverse flip it in the relevant dimension and repeat those weights so that they are the same dimension as the feature maps from the hidden layer.
With this setup, do I only have to get the gradients for W to update or should W_prime also be a part of the grad computation?
When I do it like this, the MSE drops a lot after the first minibatch and then stops changing. Using cross entropy gives NaN from the first iteration. I don't know if that is related to this issue or if it's one of many other potential bugs I have in my code.
I can't comment on the validity of your W_prime approach but I can say that you only need to compute the gradient of the cost with respect to each of the original shared variables. Your W_prime is a symbolic function of W, not a shared variable itself so you don't need to compute gradients with respect to W_prime.
Whenever you get NaNs, the first thing to try is to reduce the size of the learning rate.