How to understand the trick in MNIST experiments using tensorflow? - python

I am a beginner for Tensorflow. I am a bit confused by the tutorial. The author firstly gives a formula y=softmax(Wx+b), but use xW+b in the python code and explain it is a small trick. I do not understand the trick, why does the author need to flip the formula?
https://www.tensorflow.org/get_started/mnist/beginners
First, we multiply x by W with the expression tf.matmul(x, W). This is
flipped from when we multiplied them in our equation, where we had Wx,
as a small trick to deal with x being a 2D tensor with multiple
inputs. We then add b, and finally apply tf.nn.softmax.

As you can see from the formula,
y=softmax(Wx + b)
the input x is multiplied by the Weight variable W,but in the doc
y = tf.nn.softmax(tf.matmul(x, W) + b)
W is multiplied by x for calculation convenience, so we must flip W from 10*784 to 784*10 keep consistent with the formula.

In general in machine learning, esp. tensorflow, you always want your first dimension to represent your batch. The trick is only a way of ensuring that without transposing everything before and after each matrix multiplication.
x is not really a column vector of features, but a 2D matrix of shape (batch_size, n_features).
If you keep Wx, then you'll transpose x (to x' of shape (n_features, batch_size)) use W of shape (n_outputs, n_features), and Wx' will be of shape (n_outputs, batch_size), so you'll have to transpose it back to (batch_size, n_outputs), which is what you want in the end.
If you're using tf.matmul(x, W), then W is of shape (n_features, n_outputs ), and the result is directly of shape (batch_size, n_outputs).

I agree this is not clear at first.
x being a 2D tensor with multiple inputs
is a very succinct way to tell you that in tensorflow, data is stored in tensors following conventions that are not those of linear algebra.
In particular, the outermost dimension (i.e. columns for matrices) is always the sample dimension: that is, it has the same size as your number of samples.
When you store sample features in a 2D tensor (a matrix), the features are therefore stored in the inner-most dimension, i.e. lines. That is, tensor x is the transposed of variable $x$ in the equation. So are W and b. The fact that x.T*W.T=(W.x).T explains the swap inconsistency in the multiplication between the linear algebra equation and the tensor implementation of it.

Related

Should I transpose features or weights in Neural network?

I am learning Neural network.
Here is the complete piece of code:
https://github.com/udacity/deep-learning-v2-pytorch/blob/master/intro-to-pytorch/Part%201%20-%20Tensors%20in%20PyTorch%20(Exercises).ipynb
When I transpose features, I get the following output:
import torch
def activation(x):
return 1/(1+torch.exp(-x))
### Generate some data
torch.manual_seed(7) # Set the random seed so things are predictable
# Features are 5 random normal variables
features = torch.randn((1, 5))
# True weights for our data, random normal variables again
weights = torch.randn_like(features)
# and a true bias term
bias = torch.randn((1, 1))
product = features.t() * weights + bias
output = activation(product.sum())
tensor(0.9897)
However, if I transpose weights, I get a different output:
weights_prime = weights.view(5,1)
prod = torch.mm(features, weights_prime) + bias
y_hat = activation(prod.sum())
tensor(0.1595)
Why does this happen?
Update
I took a look at the solution:
https://github.com/udacity/deep-learning-v2-pytorch/blob/master/intro-to-pytorch/Part%201%20-%20Tensors%20in%20PyTorch%20(Solution).ipynb
And I saw this:
y = activation((features * weights).sum() + bias)
why can a matrix features(1,5) multiply another matrix weights(1,5) without transposing weights first?
Update 2
After read several posts, I realized that
matrixA * matrixB is different from torch.mm(matrixA,matrixB) and torch.matmul(matrixA,matrixB).
Could someone confirm my three understandings between?
So the * means element-wise multiplication, whereas torch.mm() and torch.matmul() are matrix-wise multiplication.
differences between torch.mm() and torch.matmul(): mm() is used specifically for 2 dimensions matrix, whereas matmul() can be used for more complicated cases.
In Neutral Network for this Udacity coding exercise mentioned in my above link, it needs element-wise multiplication.
Update 3
Just to bring in the Video screenshot for someone who has the same confusion:
And here is the video link: https://www.youtube.com/watch?time_continue=98&v=6Z7WntXays8&feature=emb_logo
Looking at https://pytorch.org/docs/master/generated/torch.nn.Linear.html
The typical linear (fully connected) layer in torch uses input features of shape (N,∗,in_features) and weights of shape (out_features,in_features) to produce an output of shape (N,*,out_features). Here N is the batch size, and * is any number of other dimensions (may be none).
The implementation is:
output = input.matmul(weight.t())
So, the answer is that neither of your formulas is correct according to convention; the standard formula is the one above.
You may use a non-standard shape since you're implementing things from scratch; as long as it's consistent it may work, but I don't recommend it for learning. It's unclear what 1 and 5 is in your code, but presumably you want 5 input features and one output feature, with a batch size of 1 as well. In which case the standard shapes should be input = torch.randn((1, 5)) for batch size=1 and in_features=5, and weights = torch.randn((5, 1)) for in_features=5 and out_features=1.
There is no reason why weights should ever be the same shape as features; thus weights = torch.randn_like(features) doesn't make sense.
Lastly, for your actual questions:
"Should I transpose features or weights in Neural network?" - in torch convention, you should transpose weights, but use matmul with the features first. Other frameworks may have a different convention; as long as in_features dimension of the weights is multiplied by the num_features dimension of the input, it would work.
"Why does this happen?" - these are two completely different calculations; there is no reason to think they would produce the same result.
"So the * means element-wise multiplication, whereas torch.mm() and torch.matmul() are matrix-wise multiplication." - Yes; mm is matrix-matrix only, matmul is vector-matrix or matrix-matrix, including batched versions of same - check the docs for everything matmul can do (which is kinda a lot).
"differences between torch.mm() and torch.matmul(): mm() is used specifically for 2 dimensions matrix, whereas matmul() can be used for more complicated cases." - Yes; the big difference is that matmul can broadcast. Use it when you specifically intend that; use mm to prevent unintentional broadcasting.
"In Neutral Network for this Udacity coding exercise mentioned in my above link, it needs element-wise multiplication." - I doubt it; it's probably an error in the Udacity code. This bit of code weights = torch.randn_like(features) looks like an error in any case; the dimensions of weights have a meaning different from the dimensions of features.
This line is taking an outer product between the two vectors.
product = features.t() * weights + bias
The resulting shape is 5x5.
If you change this to a dot product, then output will match y_hat.
product = torch.mm(weights, features.t()) + bias

Use tf.gradients or tf.hessians on flattened parameter tensor

Let's say I want to compute the Hessian of a scalar-valued function with respect to some parameters W (e.g the weights and biases of a feed-forward neural network).
If you consider the following code, implementing a two-dimensional linear model trained to minimize a MSE loss:
import numpy as np
import tensorflow as tf
x = tf.placeholder(dtype=tf.float32, shape=[None, 2]) #inputs
t = tf.placeholder(dtype=tf.float32, shape=[None,]) #labels
W = tf.placeholder(np.eye(2), dtype=tf.float32) #weights
preds = tf.matmul(x, W) #linear model
loss = tf.reduce_mean(tf.square(preds-t), axis=0) #mse loss
params = tf.trainable_variables()
hessian = tf.hessians(loss, params)
you'd expect session.run(tf.hessian,feed_dict={}) to return a 2x2 matrix (equal to W). It turns out that because paramsis a 2x2 tensor, the output is rather a tensor with shape [2, 2, 2, 2]. While I can easily reshape the tensor to obtain the matrix I want, it seems that this operation might be extremely cumbersome when paramsbecomes a list of tensors of varying size (i.e when the model is a deep neural network for instance).
It seems that are two ways around this:
Flatten params to be a 1D tensor called flat_params:
flat_params = tf.concat([tf.reshape(p, [-1]) for p in params])
so that tf.hessians(loss, flat_params) naturally returns a 2x2 matrix. However as noted in Why does Tensorflow Reshape tf.reshape() break the flow of gradients? for tf.gradients (but also holds for tf.hessians), tensorflow is not able to see the symbolic link in the graph between paramsand flat_params and tf.hessians(loss, flat_params) will raise an error as the gradients will be seen as None.
In https://afqueiruga.github.io/tensorflow/2017/12/28/hessian-mnist.html, the author of the code goes the other way, and first create the flat parameter and reshapes its parts into self.params. This trick does work and gets you the hessian with its expected shape (2x2 matrix). However, it seems to me that this will be cumbersome to use when you have a complex model, and impossible to apply if you create your model via built-in functions (like tf.layers.dense, ..).
Is there no straight-forward way to get the Hessian matrix (as in the 2x2 matrix in this example) from tf.hessians, when self.params is a list of tensor of arbitrary shapes? If not, how can you automatize the reshaping of the output tensor of tf.hessians?
It turns out (per TensorFlow r1.13) that if len(xs) > 1, then tf.hessians(ys, xs) returns tensors corresponding to only the block diagonal submatrices of the full Hessian matrix. Full story and solutions in this paper https://arxiv.org/pdf/1905.05559, and code at https://github.com/gknilsen/pyhessian

Tensorflow compute_weighted_loss Example

I want to use Tensorflows tf.losses.compute_weighted_loss but cannot find any good example. I have a multi-class classification problem and use the tf.nn.sigmoid_cross_entropy_with_logits as loss. Now I want to weigh the errors for each label independently. Let's say I have n labels, that means I need a n-sized weight vector. Unfortunately tf expects me to pass a (b, n) shaped matrix of error weights, where b is the batch size. So basically I would need to repeat the weight vector b times. That's okay given a fixed batch size, but if my batch size is variable (e.g. smaller batch at the end of the dataset) I have to adapt. Is there a way around this or did I miss something?
I just had to reshape the vector from (n,) to (1,n) to make the broadcasting possible:
error_weights = error_weights.reshape(1, error_weights.shape[0])
Adding to the existing answer, use tf.expand_dims if error_weights is a Tensor.
error_weights = tf.expand_dims(error_weights, 0) # changes shape [n] to [1, n]

Tensor Flow tutorial logloss implementation

I need to study TF in the express way and i cant understant this part:
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
It's explained with this: First, tf.log computes the logarithm of each element of y. Next, we multiply each element of y_ with the corresponding element of tf.log(y). Then tf.reduce_sum adds the elements in the second dimension of y, due to the reduction_indices=[1] parameter. Finally, tf.reduce_mean computes the mean over all the examples in the batch.
Why it does this manipulations, which are marked bold? why do wee need another dimensiom? Thanks
There are two dimensions because cross_entropy computes values for a batch of training examples. Therefore, the dimension 0 is for a batch, and dimension 1 is for different classes of a specific example. For example, if there are 3 possible classes and batch size is 2, then y is a 2D tensor of size (2, 3).

TensorFlow Multi-Layer Perceptron

I am learning TensorFlow, and my goal is to implement MultiPerceptron for my needs. I checked the MNIST tutorial with MultiPerceptron implementation and everything was clear to me except this:
_, c = sess.run([optimizer, cost], feed_dict={x: batch_x,
y: batch_y})
I guess, x is an image itself(28*28 pixels, so the input is 784 neurons) and y is a label which is an 1x10 array:
x = tf.placeholder("float", [None, n_input])
y = tf.placeholder("float", [None, n_classes])
They feed whole batches (which are packs of data points and labels)! How does tensorflow interpret this "batch" input? And how does it update the weights: simultaneously after each element in a batch, or after running through the whole batch?
And, if I need to input one number (input_shape = [1,1]) and output four numbers (output_shape = [1,4]), how should I change the tf.placeholders and in which form should I feed them into session?
When I ask, how does tensorflow interpret it, I want to know how tensorflow splits the batch into single elements. For example, batch is a 2-D array, right? In which direction does it split an array? Or it uses matrix operations and doesn't split anything?
When I ask, how should I feed my data, I want to know, should it be a 2-D array with samples at its rows and features at its columns, or, maybe, could it be a 2-D list.
When I feed my float numpy array X_train to x, which is :
x = tf.placeholder("float", [1, n_input])
I receive an error:
ValueError: Cannot feed value of shape (1, 18) for Tensor 'Placeholder_10:0', which has shape '(1, 1)'
It appears that I have to create my data as a Tensor too?
When I tried [18x1]:
Cannot feed value of shape (18, 1) for Tensor 'Placeholder_12:0', which has shape '(1, 1)'
They feed whole bathces(which are packs of data points and labels)!
Yes, this is how neural networks are usually trained (due to some nice mathematical properties of having best of two worlds - better gradient approximation than in SGD on one hand and much faster convergence than full GD).
How does tensorflow interpret this "batch" input?
It "interprets" it according to operations in your graph. You probably have reduce mean somewhere in your graph, which calculates average over your batch, thus causing this to be the "interpretation".
And how does it update the weights: 1.simultaniusly after each element in a batch? 2. After running threw the whole batch?.
As in the previous answer - there is nothing "magical" about batch, it is just another dimension, and each internal operation of neural net is well defined for the batch of data, thus there is still a single update in the end. Since you use reduce mean operation (or maybe reduce sum?) you are updating according to mean of the "small" gradients (or sum if there is reduce sum instead). Again - you could control it (up to the agglomerative behaviour, you cannot force it to do per-sample update unless you introduce while loop into the graph).
And, if i need to imput one number(input_shape = [1,1]) and ouput four nubmers (output_shape = [1,4]), how should i change the tf.placeholders and in which form should i feed them into session? THANKS!!
just set the variables, n_input=1 and n_classes=4, and you push your data as before, as [batch, n_input] and [batch, n_classes] arrays (in your case batch=1, if by "1x1" you mean "one sample of dimension 1", since your edit start to suggest that you actually do have a batch, and by 1x1 you meant a 1d input).
EDIT: 1.when i ask, how does tensorflow interpret it, i want to know, how tensorflow split the batch into single elements. For example, batch is a 2-D array, right? In which direction it splits an array. Or it uses matrix operations and doesnt split anything? 2. When i ask, how should i feed my data, i want to know, should it be a 2-D array with samples at its rows and features at its colums, or, maybe, could it be a 2-D list.
It does not split anything. It is just a matrix, and each operation is perfectly well defined for matrices as well. Usually you put examples in rows, thus in first dimension, and this is exactly what [batch, n_inputs] says - that you have batch rows each with n_inputs columns. But again - there is nothing special about it, and you could also create a graph which accepts column-wise batches if you would really need to.

Categories

Resources