I'm trying to figure out how to backpropagate a GRU Recurrent network, but I'm having trouble understanding the GRU architecture precisely.
The image below shows a GRU cell with 3 neural networks, receiving the concatenated previous hidden state and the input vector as its input.
GRU example
This image used I referenced for backpropagation, however, shows the inputs being forwarded into W and U for each of the gates, added, and then having their appropriate activation functions applied.
GRU Backpropagation
the equation for the update gate shown on wikipedia is as shown here as an example
zt = sigmoid((W(z)xt + U(z)ht-1))
can somebody explain to me what W and U represent?
EDIT:
in most of the sources I found, W and U are usually referred to as "weights", so my best guess is that W and U represent their own neural networks, but this would contradict the image I found before.
if somebody could give an example of how W and U would work in a simple GRU, that would be helpful.
Sources for the images:
https://cran.r-project.org/web/packages/rnn/vignettes/GRU_units.html
https://towardsdatascience.com/animated-rnn-lstm-and-gru-ef124d06cf45
W and U are matrices whose values are learnt during training (a.k.a. neural network weights). The matrix W multiplies the vector xt and produces a new vector. Similarly, the matrix U multiplies the vector ht-1 and produces a new vector. Those two new vectors are added together and then each component of the result is passed to the sigmoid function.
Related
Let's assume I have a number x.
This x can have some finite categorical values, by using the Embedding layer of Keras, this number x becomes a vector of embedding dimension.
Assume:
E embedded vector.
o one-hot encoded vector of x.
W embedding matrix.
Then to make x to be a vector, we do E = Wo
Now, in order to retrieve E from Keras, we can execute the following code:
K.function([model.layers[0].input], [model.get_layer('embedding').output])([input_data])
While to retrieve W from Keras, we can execute:
model.get_layer('embedding').get_weights()
However, I have no idea how can I access vector o (since Keras automatically calculates it), which is one hot encoded version of x. Since I do not have any info about how the one-hot encoding is performed (precisely, the information regarding what value is mapped to which component), it does not seem to be that trivial to do it in another way.
Any answer to the question?
while I take a look codes to understand Neural Network, I wondered about this code.
hidden_errors = numpy.dot(self.who.T, output_errors)
# The error of the hidden layer is calculated by recombining the errors of the output layer divided by the weight.
In this code, Why transpose a matrix?
I am afraid of violation of copyright, the entire code will post the GitHub address of original codes instead.
https://github.com/freebz/Make-Your-Own-Neural-Network/blob/master/neural_network.py
The transpose is necessary in order for the dot product (in this case equivalent to matrix multiplication) to be possible and calculate the right thing.
self.who is created in the line:
numpy.random.normal(0.0, pow(self.onodes, -0.5), (self.onodes, self.hnodes))
which makes it an OxH matrix, where O (rows) is the number of output nodes and H (columns) is the number of hidden nodes. targets is created in the line:
targets = numpy.array(targets_list, ndmin=2).T
which makes it an Ox1 (2D) matrix. output_errors is then the same dimensions as targets and also an Ox1 column vector of O values (as would be expected):
output_errors = targets - final_outputs
In the implemented backpropagation algorithm, the hidden errors are calculated by premultiplying the output errors by the weights connecting the hidden layer to the output layer. From the numpy doc of numpy.dot:
If both a and b are 2-D arrays, it is matrix multiplication, but using matmul or a # b is preferred.
So we need to transpose self.who to be an HxO matrix for this to work correctly in multiplying with an Ox1 matrix to give the required Hx1 matrix of hidden errors.
I have some background in machine learning and python, but I am just learning TensorFlow. I am going through the tutorial on deep convolutional neural nets to teach myself how to use it for image classification. Along the way there is an exercise, which I am having trouble completing.
EXERCISE: The model architecture in inference() differs slightly from the CIFAR-10 model specified in cuda-convnet. In particular, the top layers of Alex's original model are locally connected and not fully connected. Try editing the architecture to exactly reproduce the locally connected architecture in the top layer.
The exercise refers to the inference() function in the cifar10.py model. The 2nd to last layer (called local4) has a shape=[384, 192], and the top layer has a shape=[192, NUM_CLASSES], where NUM_CLASSES=10 of course. I think the code that we are asked to edit is somewhere in the code defining the top layer:
with tf.variable_scope('softmax_linear') as scope:
weights = _variable_with_weight_decay('weights', [192, NUM_CLASSES],
stddev=1/192.0, wd=0.0)
biases = _variable_on_cpu('biases', [NUM_CLASSES],
tf.constant_initializer(0.0))
softmax_linear = tf.add(tf.matmul(local4, weights), biases,name=scope.name
_activation_summary(softmax_linear)
But I don't see any code that determines the probability of connecting between layers, so I don't know how we can change the model from fully connected to locally connected. Does somebody know how to do this?
I'm also working on this exercise. I'll try and explain my approach properly, rather than just give the solution. It's worth looking back at the mathematics of a fully connected layer (https://www.tensorflow.org/get_started/mnist/beginners).
So the linear algebra for a fully connected layer is:
y = W * x + b
where x is the n dimensional input vector, b is an n dimensional vector of biases, and W is an n-by-n matrix of weights. The i th element of y is the sum of the i th row of W multiplied element-wise with x.
So....if you only want y[i] connected to x[i-1], x[i], and x[i+1], you simply set all values in the i th row of W to zero, apart from the (i-1) th, i th and (i+1) th column of that row. Therefore to create a locally connected layer, you simply enforce W to be a banded matrix (https://en.wikipedia.org/wiki/Band_matrix), where the size of the band is equal to the size of the locally connected neighbourhoods you want. Tensorflow has a function for setting a matrix to be banded (tf.batch_matrix_band_part(input, num_lower, num_upper, name=None)).
This seems to me to be the simplest mathematical solution to the exercise.
I'll try to answer your question although I'm not 100% I got it right as well.
Looking at the cuda-convnet architecture we can see that the TensorFlow and cuda-convnet implementations start to differ after the second pooling layer.
TensorFlow implementation implements two fully connected layers and softmax classifier.
cuda-convnet implements two locally connected layers, one fully connected layer and softmax classifier.
The code snippet you included refers only to the softmax classifier and is in fact shared between the two implementations. To reproduce the cuda-convnet implementation using TensorFlow we have to replace the existing fully connected layers with two locally connected layers and a fully connected one.
Since Tensor doesn't have locally connected layers as part of the SDK we have to figure out a way to implement it using the existing tools. Here is my attempt to implement the first locally connected layers:
with tf.variable_scope('local3') as scope:
shape = pool2.get_shape()
h = shape[1].value
w = shape[2].value
sz_local = 3 # kernel size
sz_patch = (sz_local**2)*shape[3].value
n_channels = 64
# Extract 3x3 tensor patches
patches = tf.extract_image_patches(pool2, [1,sz_local,sz_local,1], [1,1,1,1], [1,1,1,1], 'SAME')
weights = _variable_with_weight_decay('weights', shape=[1,h,w,sz_patch, n_channels], stddev=5e-2, wd=0.0)
biases = _variable_on_cpu('biases', [h,w,n_channels], tf.constant_initializer(0.1))
# "Filter" each patch with its own kernel
mul = tf.multiply(tf.expand_dims(patches, axis=-1), weights)
ssum = tf.reduce_sum(mul, axis=3)
pre_activation = tf.add(ssum, biases)
local3 = tf.nn.relu(pre_activation, name=scope.name)
I am working on a siamese CNN with attention in TensorFlow.
The CNN structure consists on a embedding lookup table shared by two CNN sharing weights.
The inputs for the network are two matrices, both containing indices for question and answer to be fed into the network (batch_size x sentence_length):
self.input_q = tf.placeholder(tf.int32, [None, sentence_length], name="input_q")
self.input_a = tf.placeholder(tf.int32, [None, sentence_length], name="input_a")
After embedding each sentence (row from the input matrix) I end up with two tensors (questions and answer) each of them of size: batch_size x sentence_lentgh x embedding_size.
Let's forget for now about the batch dimension to make things easier. This is to say, we have two matrices Qemb and Aemb, both sentence_lentgh x embedding_size.
From this two matrices I would like to construct a third one, an attention matrix A used for a posterior learnable attention feature matrix , using numpy would be defined as follows:
A[i,j] = 1.0 / (1.0 + np.linalg.norm(Qemb[i,:]-Aemb[j,:]))
This matrix is built for each input pair, so should be a part of the graph, but apparently this cannot be done in TensorFlow as there's no asingn operation by index for a Tensor.
Am I right?
I thought I could run the ops for embedding the question and answer, build the A matrix outside the graph given the computedembeddings and then feed the A matrix back to the graph to continue the next operations based on it.
self.attention_matrix = \
tf.placeholder(tf.float32,
[None, sentence_length, sentence_length],
name = "Attention_matrix")
Is there any problem with this approach that I might not be aware of?
(Appart from runing the embeddings ops twice, what doesn't seem optimal, but not a big deal)
I'm confused about how to define weight matrices for LSTM. As there are are 8 weight matrices for the LSTM, I don't know how to initialize those weight matrices for a LSTM in tensorflow.
But then I came across this implementation, which makes complete sense, as it has all the 8 weight matrices, but it doesn't use tensorflow implementation of LSTM. It is consistent with the LSTM equations. But in the tensorflow implementation of LSTM I don't know how to define all those 8 weight matrices as they are defined in the first above implementation.
Could you please help me out?
First thing first: If you take a close look, there aren't exactly 8 matrices but 14 in total. 4 * 3 matrices of W, U (parameter matrices) and b (bias vector) for Input Gate, Forget Gate, Cell State and Output State. In addition, there are two more matrices W and b for dense layer.
Now coming to actual question, I presume that you would like to know how these matrices are initialized in Tensorflow.
I am answering questions wrt Tensorflow v1.2.
Quick Answer: Use TF-api for LSTM which has a parameter called initializer to be used for initializing weight and projection matrices.
By default, bias vector is initialized to zero vector and kernel
initializers are initialized randomly using uniform distribution.
Now to take a look at where W and b are used, you need to dig little deeper inside the code. I will provide few checkpoints for the same.
Method call to evaluate the multiplications: https://github.com/tensorflow/tensorflow/blob/3686ef0d51047d2806df3e2ff6c1aac727456c1d/tensorflow/python/ops/rnn_cell_impl.py#L576
lstm_matrix = _linear([inputs, m_prev], 4 * self._num_units, bias=True)
Actual computation: https://github.com/tensorflow/tensorflow/blob/3686ef0d51047d2806df3e2ff6c1aac727456c1d/tensorflow/python/ops/rnn_cell_impl.py#L1051
weights = vs.get_variable(
_WEIGHTS_VARIABLE_NAME, [total_arg_size, output_size],
dtype=dtype,
initializer=kernel_initializer)
Here. instead of 4 separate multiplications, a single multiplication is performed and then the matrix is split into 4 parts.
So, in nutshell, Tensorflow does initialization automatically for you.
In case if you don't want to go with default initializers, you have
flexibility to provide other options.