Should I transpose features or weights in Neural network?

Should I transpose features or weights in Neural network? - python

I am learning Neural network.
Here is the complete piece of code:
https://github.com/udacity/deep-learning-v2-pytorch/blob/master/intro-to-pytorch/Part%201%20-%20Tensors%20in%20PyTorch%20(Exercises).ipynb
When I transpose features, I get the following output:
import torch
def activation(x):
return 1/(1+torch.exp(-x))
### Generate some data
torch.manual_seed(7) # Set the random seed so things are predictable
# Features are 5 random normal variables
features = torch.randn((1, 5))
# True weights for our data, random normal variables again
weights = torch.randn_like(features)
# and a true bias term
bias = torch.randn((1, 1))
product = features.t() * weights + bias
output = activation(product.sum())
tensor(0.9897)
However, if I transpose weights, I get a different output:
weights_prime = weights.view(5,1)
prod = torch.mm(features, weights_prime) + bias
y_hat = activation(prod.sum())
tensor(0.1595)
Why does this happen?
Update
I took a look at the solution:
https://github.com/udacity/deep-learning-v2-pytorch/blob/master/intro-to-pytorch/Part%201%20-%20Tensors%20in%20PyTorch%20(Solution).ipynb
And I saw this:
y = activation((features * weights).sum() + bias)
why can a matrix features(1,5) multiply another matrix weights(1,5) without transposing weights first?
Update 2
After read several posts, I realized that
matrixA * matrixB is different from torch.mm(matrixA,matrixB) and torch.matmul(matrixA,matrixB).
Could someone confirm my three understandings between?
So the * means element-wise multiplication, whereas torch.mm() and torch.matmul() are matrix-wise multiplication.
differences between torch.mm() and torch.matmul(): mm() is used specifically for 2 dimensions matrix, whereas matmul() can be used for more complicated cases.
In Neutral Network for this Udacity coding exercise mentioned in my above link, it needs element-wise multiplication.
Update 3
Just to bring in the Video screenshot for someone who has the same confusion:
And here is the video link: https://www.youtube.com/watch?time_continue=98&v=6Z7WntXays8&feature=emb_logo

Looking at https://pytorch.org/docs/master/generated/torch.nn.Linear.html
The typical linear (fully connected) layer in torch uses input features of shape (N,∗,in_features) and weights of shape (out_features,in_features) to produce an output of shape (N,*,out_features). Here N is the batch size, and * is any number of other dimensions (may be none).
The implementation is:
output = input.matmul(weight.t())
So, the answer is that neither of your formulas is correct according to convention; the standard formula is the one above.
You may use a non-standard shape since you're implementing things from scratch; as long as it's consistent it may work, but I don't recommend it for learning. It's unclear what 1 and 5 is in your code, but presumably you want 5 input features and one output feature, with a batch size of 1 as well. In which case the standard shapes should be input = torch.randn((1, 5)) for batch size=1 and in_features=5, and weights = torch.randn((5, 1)) for in_features=5 and out_features=1.
There is no reason why weights should ever be the same shape as features; thus weights = torch.randn_like(features) doesn't make sense.
Lastly, for your actual questions:
"Should I transpose features or weights in Neural network?" - in torch convention, you should transpose weights, but use matmul with the features first. Other frameworks may have a different convention; as long as in_features dimension of the weights is multiplied by the num_features dimension of the input, it would work.
"Why does this happen?" - these are two completely different calculations; there is no reason to think they would produce the same result.
"So the * means element-wise multiplication, whereas torch.mm() and torch.matmul() are matrix-wise multiplication." - Yes; mm is matrix-matrix only, matmul is vector-matrix or matrix-matrix, including batched versions of same - check the docs for everything matmul can do (which is kinda a lot).
"differences between torch.mm() and torch.matmul(): mm() is used specifically for 2 dimensions matrix, whereas matmul() can be used for more complicated cases." - Yes; the big difference is that matmul can broadcast. Use it when you specifically intend that; use mm to prevent unintentional broadcasting.
"In Neutral Network for this Udacity coding exercise mentioned in my above link, it needs element-wise multiplication." - I doubt it; it's probably an error in the Udacity code. This bit of code weights = torch.randn_like(features) looks like an error in any case; the dimensions of weights have a meaning different from the dimensions of features.

This line is taking an outer product between the two vectors.
product = features.t() * weights + bias
The resulting shape is 5x5.
If you change this to a dot product, then output will match y_hat.
product = torch.mm(weights, features.t()) + bias

Related

Batch inference of softmax does not sum to 1

I am working with REINFORCE algorithm with PyTorch. I noticed that the batch inference/predictions of my simple network with Softmax doesn’t sum to 1 (not even close to 1). I am attaching a minimum working code so that you can reproduce it. What am I missing here?
import numpy as np
import torch
obs_size = 9
HIDDEN_SIZE = 9
n_actions = 2
np.random.seed(0)
model = torch.nn.Sequential(
torch.nn.Linear(obs_size, HIDDEN_SIZE),
torch.nn.ReLU(),
torch.nn.Linear(HIDDEN_SIZE, n_actions),
torch.nn.Softmax(dim=0)
)
state_transitions = np.random.rand(3, obs_size)
state_batch = torch.Tensor(state_transitions)
pred_batch = model(state_batch) # WRONG PREDICTIONS!
print('wrong predictions:\n', *pred_batch.detach().numpy())
# [0.34072137 0.34721774] [0.30972624 0.30191955] [0.3495524 0.3508627]
# DOES NOT SUM TO 1 !!!
pred_batch = [model(s).detach().numpy() for s in state_batch] # CORRECT PREDICTIONS
print('correct predictions:\n', *pred_batch)
# [0.5955179 0.40448207] [0.6574412 0.34255883] [0.624833 0.37516695]
# DOES SUM TO 1 AS EXPECTED

Although PyTorch lets us get away with it, we don’t actually provide an input with the right dimensionality. We have a model that takes one input and produces one output, but PyTorch nn.Module and its subclasses are designed to do so on multiple samples at the same time. To accommodate multiple samples, modules expect the zeroth dimension of the input to be the number of samples in the batch.
Deep Learning with PyTorch
That your model works on each individual sample is an implementation nicety. You have incorrectly specified the dimension for the softmax (across batches instead of across the variables), and hence when given a batch dimension it is computing the softmax across samples instead of within samples:
nn.Softmax requires us to specify the dimension along which the softmax function is applied:
softmax = nn.Softmax(dim=1)
In this case, we have two input vectors in two rows (just like when we work with
batches), so we initialize nn.Softmax to operate along dimension 1.
Change torch.nn.Softmax(dim=0) to torch.nn.Softmax(dim=1) to get appropriate results.

Why transpose a matrix here? (Neural network related)

while I take a look codes to understand Neural Network, I wondered about this code.
hidden_errors = numpy.dot(self.who.T, output_errors)
# The error of the hidden layer is calculated by recombining the errors of the output layer divided by the weight.
In this code, Why transpose a matrix?
I am afraid of violation of copyright, the entire code will post the GitHub address of original codes instead.
https://github.com/freebz/Make-Your-Own-Neural-Network/blob/master/neural_network.py

The transpose is necessary in order for the dot product (in this case equivalent to matrix multiplication) to be possible and calculate the right thing.
self.who is created in the line:
numpy.random.normal(0.0, pow(self.onodes, -0.5), (self.onodes, self.hnodes))
which makes it an OxH matrix, where O (rows) is the number of output nodes and H (columns) is the number of hidden nodes. targets is created in the line:
targets = numpy.array(targets_list, ndmin=2).T
which makes it an Ox1 (2D) matrix. output_errors is then the same dimensions as targets and also an Ox1 column vector of O values (as would be expected):
output_errors = targets - final_outputs
In the implemented backpropagation algorithm, the hidden errors are calculated by premultiplying the output errors by the weights connecting the hidden layer to the output layer. From the numpy doc of numpy.dot:
If both a and b are 2-D arrays, it is matrix multiplication, but using matmul or a # b is preferred.
So we need to transpose self.who to be an HxO matrix for this to work correctly in multiplying with an Ox1 matrix to give the required Hx1 matrix of hidden errors.

Tensorflow compute_weighted_loss Example

I want to use Tensorflows tf.losses.compute_weighted_loss but cannot find any good example. I have a multi-class classification problem and use the tf.nn.sigmoid_cross_entropy_with_logits as loss. Now I want to weigh the errors for each label independently. Let's say I have n labels, that means I need a n-sized weight vector. Unfortunately tf expects me to pass a (b, n) shaped matrix of error weights, where b is the batch size. So basically I would need to repeat the weight vector b times. That's okay given a fixed batch size, but if my batch size is variable (e.g. smaller batch at the end of the dataset) I have to adapt. Is there a way around this or did I miss something?

I just had to reshape the vector from (n,) to (1,n) to make the broadcasting possible:
error_weights = error_weights.reshape(1, error_weights.shape[0])

Adding to the existing answer, use tf.expand_dims if error_weights is a Tensor.
error_weights = tf.expand_dims(error_weights, 0) # changes shape [n] to [1, n]

Creating dynamic matrix for every element in a batch in TensorFlow

I am working on a siamese CNN with attention in TensorFlow.
The CNN structure consists on a embedding lookup table shared by two CNN sharing weights.
The inputs for the network are two matrices, both containing indices for question and answer to be fed into the network (batch_size x sentence_length):
self.input_q = tf.placeholder(tf.int32, [None, sentence_length], name="input_q")
self.input_a = tf.placeholder(tf.int32, [None, sentence_length], name="input_a")
After embedding each sentence (row from the input matrix) I end up with two tensors (questions and answer) each of them of size: batch_size x sentence_lentgh x embedding_size.
Let's forget for now about the batch dimension to make things easier. This is to say, we have two matrices Qemb and Aemb, both sentence_lentgh x embedding_size.
From this two matrices I would like to construct a third one, an attention matrix A used for a posterior learnable attention feature matrix , using numpy would be defined as follows:
A[i,j] = 1.0 / (1.0 + np.linalg.norm(Qemb[i,:]-Aemb[j,:]))
This matrix is built for each input pair, so should be a part of the graph, but apparently this cannot be done in TensorFlow as there's no asingn operation by index for a Tensor.
Am I right?
I thought I could run the ops for embedding the question and answer, build the A matrix outside the graph given the computedembeddings and then feed the A matrix back to the graph to continue the next operations based on it.
self.attention_matrix = \
tf.placeholder(tf.float32,
[None, sentence_length, sentence_length],
name = "Attention_matrix")
Is there any problem with this approach that I might not be aware of?
(Appart from runing the embeddings ops twice, what doesn't seem optimal, but not a big deal)

What's the difference between these two LSTM implementation in tensorflow,how to initialize 8 weight matrices of LSTM?

I'm confused about how to define weight matrices for LSTM. As there are are 8 weight matrices for the LSTM, I don't know how to initialize those weight matrices for a LSTM in tensorflow.
But then I came across this implementation, which makes complete sense, as it has all the 8 weight matrices, but it doesn't use tensorflow implementation of LSTM. It is consistent with the LSTM equations. But in the tensorflow implementation of LSTM I don't know how to define all those 8 weight matrices as they are defined in the first above implementation.
Could you please help me out?

First thing first: If you take a close look, there aren't exactly 8 matrices but 14 in total. 4 * 3 matrices of W, U (parameter matrices) and b (bias vector) for Input Gate, Forget Gate, Cell State and Output State. In addition, there are two more matrices W and b for dense layer.
Now coming to actual question, I presume that you would like to know how these matrices are initialized in Tensorflow.
I am answering questions wrt Tensorflow v1.2.
Quick Answer: Use TF-api for LSTM which has a parameter called initializer to be used for initializing weight and projection matrices.
By default, bias vector is initialized to zero vector and kernel
initializers are initialized randomly using uniform distribution.
Now to take a look at where W and b are used, you need to dig little deeper inside the code. I will provide few checkpoints for the same.
Method call to evaluate the multiplications: https://github.com/tensorflow/tensorflow/blob/3686ef0d51047d2806df3e2ff6c1aac727456c1d/tensorflow/python/ops/rnn_cell_impl.py#L576
lstm_matrix = _linear([inputs, m_prev], 4 * self._num_units, bias=True)
Actual computation: https://github.com/tensorflow/tensorflow/blob/3686ef0d51047d2806df3e2ff6c1aac727456c1d/tensorflow/python/ops/rnn_cell_impl.py#L1051
weights = vs.get_variable(
_WEIGHTS_VARIABLE_NAME, [total_arg_size, output_size],
dtype=dtype,
initializer=kernel_initializer)
Here. instead of 4 separate multiplications, a single multiplication is performed and then the matrix is split into 4 parts.
So, in nutshell, Tensorflow does initialization automatically for you.
In case if you don't want to go with default initializers, you have
flexibility to provide other options.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.