Why transpose a matrix here? (Neural network related)

Why transpose a matrix here? (Neural network related) - python

while I take a look codes to understand Neural Network, I wondered about this code.
hidden_errors = numpy.dot(self.who.T, output_errors)
# The error of the hidden layer is calculated by recombining the errors of the output layer divided by the weight.
In this code, Why transpose a matrix?
I am afraid of violation of copyright, the entire code will post the GitHub address of original codes instead.
https://github.com/freebz/Make-Your-Own-Neural-Network/blob/master/neural_network.py

The transpose is necessary in order for the dot product (in this case equivalent to matrix multiplication) to be possible and calculate the right thing.
self.who is created in the line:
numpy.random.normal(0.0, pow(self.onodes, -0.5), (self.onodes, self.hnodes))
which makes it an OxH matrix, where O (rows) is the number of output nodes and H (columns) is the number of hidden nodes. targets is created in the line:
targets = numpy.array(targets_list, ndmin=2).T
which makes it an Ox1 (2D) matrix. output_errors is then the same dimensions as targets and also an Ox1 column vector of O values (as would be expected):
output_errors = targets - final_outputs
In the implemented backpropagation algorithm, the hidden errors are calculated by premultiplying the output errors by the weights connecting the hidden layer to the output layer. From the numpy doc of numpy.dot:
If both a and b are 2-D arrays, it is matrix multiplication, but using matmul or a # b is preferred.
So we need to transpose self.who to be an HxO matrix for this to work correctly in multiplying with an Ox1 matrix to give the required Hx1 matrix of hidden errors.

Related

Should I transpose features or weights in Neural network?

I am learning Neural network.
Here is the complete piece of code:
https://github.com/udacity/deep-learning-v2-pytorch/blob/master/intro-to-pytorch/Part%201%20-%20Tensors%20in%20PyTorch%20(Exercises).ipynb
When I transpose features, I get the following output:
import torch
def activation(x):
return 1/(1+torch.exp(-x))
### Generate some data
torch.manual_seed(7) # Set the random seed so things are predictable
# Features are 5 random normal variables
features = torch.randn((1, 5))
# True weights for our data, random normal variables again
weights = torch.randn_like(features)
# and a true bias term
bias = torch.randn((1, 1))
product = features.t() * weights + bias
output = activation(product.sum())
tensor(0.9897)
However, if I transpose weights, I get a different output:
weights_prime = weights.view(5,1)
prod = torch.mm(features, weights_prime) + bias
y_hat = activation(prod.sum())
tensor(0.1595)
Why does this happen?
Update
I took a look at the solution:
https://github.com/udacity/deep-learning-v2-pytorch/blob/master/intro-to-pytorch/Part%201%20-%20Tensors%20in%20PyTorch%20(Solution).ipynb
And I saw this:
y = activation((features * weights).sum() + bias)
why can a matrix features(1,5) multiply another matrix weights(1,5) without transposing weights first?
Update 2
After read several posts, I realized that
matrixA * matrixB is different from torch.mm(matrixA,matrixB) and torch.matmul(matrixA,matrixB).
Could someone confirm my three understandings between?
So the * means element-wise multiplication, whereas torch.mm() and torch.matmul() are matrix-wise multiplication.
differences between torch.mm() and torch.matmul(): mm() is used specifically for 2 dimensions matrix, whereas matmul() can be used for more complicated cases.
In Neutral Network for this Udacity coding exercise mentioned in my above link, it needs element-wise multiplication.
Update 3
Just to bring in the Video screenshot for someone who has the same confusion:
And here is the video link: https://www.youtube.com/watch?time_continue=98&v=6Z7WntXays8&feature=emb_logo

Looking at https://pytorch.org/docs/master/generated/torch.nn.Linear.html
The typical linear (fully connected) layer in torch uses input features of shape (N,∗,in_features) and weights of shape (out_features,in_features) to produce an output of shape (N,*,out_features). Here N is the batch size, and * is any number of other dimensions (may be none).
The implementation is:
output = input.matmul(weight.t())
So, the answer is that neither of your formulas is correct according to convention; the standard formula is the one above.
You may use a non-standard shape since you're implementing things from scratch; as long as it's consistent it may work, but I don't recommend it for learning. It's unclear what 1 and 5 is in your code, but presumably you want 5 input features and one output feature, with a batch size of 1 as well. In which case the standard shapes should be input = torch.randn((1, 5)) for batch size=1 and in_features=5, and weights = torch.randn((5, 1)) for in_features=5 and out_features=1.
There is no reason why weights should ever be the same shape as features; thus weights = torch.randn_like(features) doesn't make sense.
Lastly, for your actual questions:
"Should I transpose features or weights in Neural network?" - in torch convention, you should transpose weights, but use matmul with the features first. Other frameworks may have a different convention; as long as in_features dimension of the weights is multiplied by the num_features dimension of the input, it would work.
"Why does this happen?" - these are two completely different calculations; there is no reason to think they would produce the same result.
"So the * means element-wise multiplication, whereas torch.mm() and torch.matmul() are matrix-wise multiplication." - Yes; mm is matrix-matrix only, matmul is vector-matrix or matrix-matrix, including batched versions of same - check the docs for everything matmul can do (which is kinda a lot).
"differences between torch.mm() and torch.matmul(): mm() is used specifically for 2 dimensions matrix, whereas matmul() can be used for more complicated cases." - Yes; the big difference is that matmul can broadcast. Use it when you specifically intend that; use mm to prevent unintentional broadcasting.
"In Neutral Network for this Udacity coding exercise mentioned in my above link, it needs element-wise multiplication." - I doubt it; it's probably an error in the Udacity code. This bit of code weights = torch.randn_like(features) looks like an error in any case; the dimensions of weights have a meaning different from the dimensions of features.

This line is taking an outer product between the two vectors.
product = features.t() * weights + bias
The resulting shape is 5x5.
If you change this to a dot product, then output will match y_hat.
product = torch.mm(weights, features.t()) + bias

Use tf.gradients or tf.hessians on flattened parameter tensor

Let's say I want to compute the Hessian of a scalar-valued function with respect to some parameters W (e.g the weights and biases of a feed-forward neural network).
If you consider the following code, implementing a two-dimensional linear model trained to minimize a MSE loss:
import numpy as np
import tensorflow as tf
x = tf.placeholder(dtype=tf.float32, shape=[None, 2]) #inputs
t = tf.placeholder(dtype=tf.float32, shape=[None,]) #labels
W = tf.placeholder(np.eye(2), dtype=tf.float32) #weights
preds = tf.matmul(x, W) #linear model
loss = tf.reduce_mean(tf.square(preds-t), axis=0) #mse loss
params = tf.trainable_variables()
hessian = tf.hessians(loss, params)
you'd expect session.run(tf.hessian,feed_dict={}) to return a 2x2 matrix (equal to W). It turns out that because paramsis a 2x2 tensor, the output is rather a tensor with shape [2, 2, 2, 2]. While I can easily reshape the tensor to obtain the matrix I want, it seems that this operation might be extremely cumbersome when paramsbecomes a list of tensors of varying size (i.e when the model is a deep neural network for instance).
It seems that are two ways around this:
Flatten params to be a 1D tensor called flat_params:
flat_params = tf.concat([tf.reshape(p, [-1]) for p in params])
so that tf.hessians(loss, flat_params) naturally returns a 2x2 matrix. However as noted in Why does Tensorflow Reshape tf.reshape() break the flow of gradients? for tf.gradients (but also holds for tf.hessians), tensorflow is not able to see the symbolic link in the graph between paramsand flat_params and tf.hessians(loss, flat_params) will raise an error as the gradients will be seen as None.
In https://afqueiruga.github.io/tensorflow/2017/12/28/hessian-mnist.html, the author of the code goes the other way, and first create the flat parameter and reshapes its parts into self.params. This trick does work and gets you the hessian with its expected shape (2x2 matrix). However, it seems to me that this will be cumbersome to use when you have a complex model, and impossible to apply if you create your model via built-in functions (like tf.layers.dense, ..).
Is there no straight-forward way to get the Hessian matrix (as in the 2x2 matrix in this example) from tf.hessians, when self.params is a list of tensor of arbitrary shapes? If not, how can you automatize the reshaping of the output tensor of tf.hessians?

It turns out (per TensorFlow r1.13) that if len(xs) > 1, then tf.hessians(ys, xs) returns tensors corresponding to only the block diagonal submatrices of the full Hessian matrix. Full story and solutions in this paper https://arxiv.org/pdf/1905.05559, and code at https://github.com/gknilsen/pyhessian

How to understand the trick in MNIST experiments using tensorflow?

I am a beginner for Tensorflow. I am a bit confused by the tutorial. The author firstly gives a formula y=softmax(Wx+b), but use xW+b in the python code and explain it is a small trick. I do not understand the trick, why does the author need to flip the formula?
https://www.tensorflow.org/get_started/mnist/beginners
First, we multiply x by W with the expression tf.matmul(x, W). This is
flipped from when we multiplied them in our equation, where we had Wx,
as a small trick to deal with x being a 2D tensor with multiple
inputs. We then add b, and finally apply tf.nn.softmax.

As you can see from the formula,
y=softmax(Wx + b)
the input x is multiplied by the Weight variable W,but in the doc
y = tf.nn.softmax(tf.matmul(x, W) + b)
W is multiplied by x for calculation convenience, so we must flip W from 10*784 to 784*10 keep consistent with the formula.

In general in machine learning, esp. tensorflow, you always want your first dimension to represent your batch. The trick is only a way of ensuring that without transposing everything before and after each matrix multiplication.
x is not really a column vector of features, but a 2D matrix of shape (batch_size, n_features).
If you keep Wx, then you'll transpose x (to x' of shape (n_features, batch_size)) use W of shape (n_outputs, n_features), and Wx' will be of shape (n_outputs, batch_size), so you'll have to transpose it back to (batch_size, n_outputs), which is what you want in the end.
If you're using tf.matmul(x, W), then W is of shape (n_features, n_outputs ), and the result is directly of shape (batch_size, n_outputs).

I agree this is not clear at first.
x being a 2D tensor with multiple inputs
is a very succinct way to tell you that in tensorflow, data is stored in tensors following conventions that are not those of linear algebra.
In particular, the outermost dimension (i.e. columns for matrices) is always the sample dimension: that is, it has the same size as your number of samples.
When you store sample features in a 2D tensor (a matrix), the features are therefore stored in the inner-most dimension, i.e. lines. That is, tensor x is the transposed of variable $x$ in the equation. So are W and b. The fact that x.T*W.T=(W.x).T explains the swap inconsistency in the multiplication between the linear algebra equation and the tensor implementation of it.

Creating dynamic matrix for every element in a batch in TensorFlow

I am working on a siamese CNN with attention in TensorFlow.
The CNN structure consists on a embedding lookup table shared by two CNN sharing weights.
The inputs for the network are two matrices, both containing indices for question and answer to be fed into the network (batch_size x sentence_length):
self.input_q = tf.placeholder(tf.int32, [None, sentence_length], name="input_q")
self.input_a = tf.placeholder(tf.int32, [None, sentence_length], name="input_a")
After embedding each sentence (row from the input matrix) I end up with two tensors (questions and answer) each of them of size: batch_size x sentence_lentgh x embedding_size.
Let's forget for now about the batch dimension to make things easier. This is to say, we have two matrices Qemb and Aemb, both sentence_lentgh x embedding_size.
From this two matrices I would like to construct a third one, an attention matrix A used for a posterior learnable attention feature matrix , using numpy would be defined as follows:
A[i,j] = 1.0 / (1.0 + np.linalg.norm(Qemb[i,:]-Aemb[j,:]))
This matrix is built for each input pair, so should be a part of the graph, but apparently this cannot be done in TensorFlow as there's no asingn operation by index for a Tensor.
Am I right?
I thought I could run the ops for embedding the question and answer, build the A matrix outside the graph given the computedembeddings and then feed the A matrix back to the graph to continue the next operations based on it.
self.attention_matrix = \
tf.placeholder(tf.float32,
[None, sentence_length, sentence_length],
name = "Attention_matrix")
Is there any problem with this approach that I might not be aware of?
(Appart from runing the embeddings ops twice, what doesn't seem optimal, but not a big deal)

What's the difference between these two LSTM implementation in tensorflow,how to initialize 8 weight matrices of LSTM?

I'm confused about how to define weight matrices for LSTM. As there are are 8 weight matrices for the LSTM, I don't know how to initialize those weight matrices for a LSTM in tensorflow.
But then I came across this implementation, which makes complete sense, as it has all the 8 weight matrices, but it doesn't use tensorflow implementation of LSTM. It is consistent with the LSTM equations. But in the tensorflow implementation of LSTM I don't know how to define all those 8 weight matrices as they are defined in the first above implementation.
Could you please help me out?

First thing first: If you take a close look, there aren't exactly 8 matrices but 14 in total. 4 * 3 matrices of W, U (parameter matrices) and b (bias vector) for Input Gate, Forget Gate, Cell State and Output State. In addition, there are two more matrices W and b for dense layer.
Now coming to actual question, I presume that you would like to know how these matrices are initialized in Tensorflow.
I am answering questions wrt Tensorflow v1.2.
Quick Answer: Use TF-api for LSTM which has a parameter called initializer to be used for initializing weight and projection matrices.
By default, bias vector is initialized to zero vector and kernel
initializers are initialized randomly using uniform distribution.
Now to take a look at where W and b are used, you need to dig little deeper inside the code. I will provide few checkpoints for the same.
Method call to evaluate the multiplications: https://github.com/tensorflow/tensorflow/blob/3686ef0d51047d2806df3e2ff6c1aac727456c1d/tensorflow/python/ops/rnn_cell_impl.py#L576
lstm_matrix = _linear([inputs, m_prev], 4 * self._num_units, bias=True)
Actual computation: https://github.com/tensorflow/tensorflow/blob/3686ef0d51047d2806df3e2ff6c1aac727456c1d/tensorflow/python/ops/rnn_cell_impl.py#L1051
weights = vs.get_variable(
_WEIGHTS_VARIABLE_NAME, [total_arg_size, output_size],
dtype=dtype,
initializer=kernel_initializer)
Here. instead of 4 separate multiplications, a single multiplication is performed and then the matrix is split into 4 parts.
So, in nutshell, Tensorflow does initialization automatically for you.
In case if you don't want to go with default initializers, you have
flexibility to provide other options.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.