Layer normalization in pytorch

Layer normalization in pytorch - python

I'm trying to test layer normalization function of PyTorch.
But I don't know why b[0] and result have different values here
Did I do something wrong ?
import numpy as np
import torch
import torch.nn as nn
a = torch.randn(1, 5)
m = nn.LayerNorm(a.size()[1:], elementwise_affine= False)
b = m(a)
Result:
input: a[0] = tensor([-1.3549, 0.3857, 0.1110, -0.8456, 0.1486])
output: b[0] = tensor([-1.5561, 1.0386, 0.6291, -0.7967, 0.6851])
mean = torch.mean(a[0])
var = torch.var(a[0])
result = (a[0]-mean)/(torch.sqrt(var+1e-5))
Result:
result = tensor([-1.3918, 0.9289, 0.5627, -0.7126, 0.6128])
And, for n*2 normalization , the result of pytorch layer norm is always [1.0 , -1.0] (or [-1.0, 1.0]) . I can't understand why. Please let me know if you have any hints
a = torch.randn(1, 2)
m = nn.LayerNorm(a.size()[1:], elementwise_affine= False)
b = m(a)
Result:
b = tensor([-1.0000, 1.0000])

For calculating the variance use torch.var(a[0], unbiased=False). Then you will get the same result. By default pytorch calculates the unbiased estimation of the variance.

For your 1st question, as #Theodor said, you need to use unbiased=False unbiased when calculating variance.
Only if you want to explore more: As your input size is 5, unbiased estimation of variance will be 5/4 = 1.25 times the biased estimation. Because unbiased estimation uses N-1 instead of N in the denominator. As a result, each value of result that you generated, is sqrt(4/5) = 0.8944 times the values of b[0].
About your 2nd question:
And, for n*2 normalization , the result of pytorch layer norm is always [1.0 , -1.0]
This is reasonable. Suppose only two elements are a and b. So, mean will be (a+b)/2 and variance ((a-b)^2)/4. So, the normalization result will be [((a-b)/2) / (sqrt(variance)) ((b-a)/2) / (sqrt(variance))] which is essentially [1, -1] or [-1, 1] depending on a > b or a < b.

Related

Normally distributed vector given mean, stddev|variance

how to do this in torch
np.random.normal(loc=mean,scale=stdev,size=vsize)
i.e. by providing mean, stddev or variance

There's a similar torch function torch.normal:
torch.normal(mean=mean, std=std)
If mean and std are scalars, this will produce a scalar value. You can simply repeat them for the target you want it to be, eg:
mean = 2
std = 10
size = (3,3)
r = torch.normal(mean=torch.full(size,mean).float(),std=torch.full(size,mean).float())
print(r)
> tensor([[ 2.2263, 1.1374, 4.5766],
[ 3.4727, 2.6712, 2.4878],
[-0.1787, 2.9600, 2.7598]])

What is a tensor argument to Normal supposed to mean in Distributions Package of Pytorch?

I understand torch.Normal(loc, scale) is a class corresponding to univariate normal distribution in pytorch.
I understand how it works when loc and scale are numbers.
The problem is when the inputs to torch.Normal are tensors as opposed to numbers. In that case I do not understand it well. What is the exact interpretation/usage of such tensor arguments?
See for example y_dist in code below. loc and scale are tensors for y_dist. What does this exactly mean? I do not think this converts the univariate distribution to multivariate, does it? Does it instead form a group of univariate distributions?
import torch as pt
ptd = pt.distributions
x_dist = ptd.Normal(loc = 2, scale = 3)
x_samples = x_dist.sample()
batch_size = 256
y_dist = ptd.Normal(loc = 0.25 * pt.ones(batch_size, dtype=pt.float32), scale = pt.ones(batch_size, dtype=pt.float32))

As you said, if loc (a.k.a. mu) and scale (a.k.a. sigma) are floats then it will sample from a normal distribution, with loc as the mean, and scale as the standard deviation.
Providing tensors instead of floats will just make it sample from more than one normal distribution independently (unlike torch.distributions.MultivariateNormal of course)
If you look at the source code you will see loc and scale are broadcasted to the same shape on __init__.
Here's an example to show this behavior:
>>> mu = torch.tensor([-10, 10], dtype=torch.float)
>>> sigma = torch.ones(2, 2)
>>> y_dist = Normal(loc=mu, scale=sigma)
Above mu is 1D, while sigma is 2D, yet:
>>> y_dist.loc
tensor([[-10., 10.],
[-10., 10.]])
So it will get two samples from N(-10, 1) and two samples from N(10, 1)
>>> y_dist.sample()
tensor([[ -9.1686, 10.6062],
[-10.0974, 8.5439]])
Similarly:
>>> mu = torch.zeros(2, 2)
>>> sigma = torch.tensor([0.001, 1000], dtype=torch.float)
>>> y_dist = Normal(loc=mu, scale=sigma)
Will broadcast scale to be:
>>> y_dist.scale
tensor([[1.0000e-03, 1.0000e+01],
[1.0000e-03, 1.0000e+01]])
>>> y_dist.sample()
tensor([[-8.0329e-04, 1.4213e+01],
[-1.4907e-03, 3.1190e+02]])

Understanding np.where() in the context of Logistic Regression

I am currently studying the Deep Learning specialization taught on Coursera by Andrew Ng. In the first assignment, I have to define a prediction function, and wanted to know if my alternative solution is as valid as the actual solution.
Please let me know if my understanding of the np.where() function is correct as I have commented on this in the code under "ALTERNATIVE SOLUTION COMMENTS". Also, it would be much appreciated if my understanding under the "ACTUAL SOLUTION COMMENTS" could be checked as well.
The alternate solution that uses np.where() also works when I try to increase the number of examples/inputs in X the current amount (m = 3), to 4, to 5, and so on.
Let me know what you think, and if both solutions are just as good as the other! Thanks.
def predict(w, b, X):
'''
Predict whether the label is 0 or 1 using learned logistic regression parameters (w, b)
Arguments:
w -- weights, a numpy array of size (num_px * num_px * 3, 1)
b -- bias, a scalar
X -- data of size (num_px * num_px * 3, number of examples)
Returns:
Y_prediction -- a numpy array (vector) containing all predictions (0/1) for the examples in X
'''
m = X.shape[1]
Y_prediction = np.zeros((1,m)) # Initialize Y_prediction as an array of zeros
w = w.reshape(X.shape[0], 1)
# Compute vector "A" predicting the probabilities of a cat being present in the picture
### START CODE HERE ### (≈ 1 line of code)
A = sigmoid(np.dot(w.T, X) + b) # Note: The shape of A will always be a (1,m) row vector
### END CODE HERE ###
for i in range(A.shape[1]): # for i in range(# of examples in A = # of examples in our set)
# Convert probabilities A[0,i] to actual predictions p[0,i]
### START CODE HERE ### (≈ 4 lines of code)
Y_prediction[0, i] = 1 if A[0, i] > 0.5 else 0
'''
ACTUAL SOLUTION COMMENTS:
The above reads as:
Change/update the i-th value of Y_prediction to 1 if the corresponding i-th value in A is > 0.5.
Otherwise, change/update the i-th value of Y_prediction to 0.
'''
'''
ALTERNATIVE SOLUTION COMMENTS:
To condense this code, you could delete the for loop and Y_prediction var from the top,
and then use the following one line:
return np.where(A > 0.5, np.ones((1,m)), np.zeros((1,m)))
This reads as:
Given the condition > 0.5, return np.ones((1,m)) if True,
or return np.zeros((1,m)) if False.
Another way to understand this is as follows:
Tell me where in the array A, entries satisfies the condition A > 0.5,
At those positions, give me np.ones((1,m)), otherwise, give me
np.zeros((1,m))
'''
### END CODE HERE ###
assert(Y_prediction.shape == (1, m))
return Y_prediction
w = np.array([[0.1124579],[0.23106775]])
b = -0.3
X = np.array([[1.,-1.1,-3.2],[1.2,2.,0.1]])
print(sigmoid(np.dot(w.T, X) + b))
print ("predictions = " + str(predict(w, b, X))) # Output gives 1,1,0 as expected

Your alternative approach seems fine. As a remark, I'll add that you don't even need the np.ones and np.zeros, you can just specify directly the integers 0 and 1. When using np.where, as long as X and y (the values to replace according to the condition) and the same condition are broadcastable, it should work fine.
Here's a simple example:
y_pred = np.random.rand(1,6).round(2)
# array([[0.53, 0.54, 0.68, 0.34, 0.53, 0.46]])
np.where(y_pred> 0.5, np.ones((1,6)), np.zeros((1,6)))
# array([[1., 1., 1., 0., 1., 0.]])
And using integers:
np.where(y_pred> 0.5,1,0)
# array([[1, 1, 1, 0, 1, 0]])
As per your comments on how the function works it is indeed working as you describe. Perhaps just instead of To condense this code, I'd argue that using numpy makes it more efficient, and also intelligible in this case.

How to normalize all axis with batch normalization?

As far as I understand, for tf.layers.batch_normalization the axis I define is the axis that gets normalized.
Simply put:
Given these values
a = [[0, 2],
[1, 4]]
with shape (2, 2) and therefore axis 0 and 1.
Normalizing over axis 1 would mean to reduce axis 0 to its mean and standard deviation and then take these values for the normalization.
Therefore
bn = tf.layers.batch_normalization(a, axis=[1])
would have (nearly) the same result as
m, v = tf.nn.moments(a, axes=[0])
bn = (a - m) / tf.sqrt(v)
But how would I do tf.layers.batch_normalization for all axis?
With the mean and standard deviation calculation from before this would be easy:
m, v = tf.nn.moments(a, axes=[0, 1])
bn = (a - m) / tf.sqrt(v)
But how to do this with batch normalization?
bn = tf.layers.batch_normalization(a, axis=[???])
I tried the following that doesn't work:
axis = None: AttributeError: 'BatchNormalization' object has no attribute 'axis'
axis = []: IndexError: list index out of range
axis = [0, 1]: All results are zero

Unfortunately I don't think this is feasable using the batch_normalization layers/function of the tensorflow API.
As the name of the function suggests, it's intended to perform "batch" normalization, so it's expected to normalize over the features axis given the current batch (usually dimension 0).

This can be achieved with layer normalization:
>>> data = tf.constant(np.arange(10).reshape(5, 2) * 10, dtype=tf.float32)
layer = tf.keras.layers.LayerNormalization(axis=[0, 1])
output = layer(data)
print(output)
tf.Tensor(
[[-1.5666981 -1.2185429 ]
[-0.8703878 -0.5222327 ]
[-0.17407757 0.17407756]
[ 0.52223265 0.8703878 ]
[ 1.2185429 1.5666981 ]], shape=(5, 2), dtype=float32)
The difference with batch normalization is that layer normalization applies the operation to each unit within a batch separately.
If you want to do this operation over a batch, go for batch norm though. Similarly this works by setting axis as a list.

What is the difference between np.mean and tf.reduce_mean?

In the MNIST beginner tutorial, there is the statement
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
tf.cast basically changes the type of tensor the object is, but what is the difference between tf.reduce_mean and np.mean?
Here is the doc on tf.reduce_mean:
reduce_mean(input_tensor, reduction_indices=None, keep_dims=False, name=None)
input_tensor: The tensor to reduce. Should have numeric type.
reduction_indices: The dimensions to reduce. If None (the defaut), reduces all dimensions.
# 'x' is [[1., 1. ]]
# [2., 2.]]
tf.reduce_mean(x) ==> 1.5
tf.reduce_mean(x, 0) ==> [1.5, 1.5]
tf.reduce_mean(x, 1) ==> [1., 2.]
For a 1D vector, it looks like np.mean == tf.reduce_mean, but I don't understand what's happening in tf.reduce_mean(x, 1) ==> [1., 2.]. tf.reduce_mean(x, 0) ==> [1.5, 1.5] kind of makes sense, since mean of [1, 2] and [1, 2] is [1.5, 1.5], but what's going on with tf.reduce_mean(x, 1)?

The functionality of numpy.mean and tensorflow.reduce_mean are the same. They do the same thing. From the documentation, for numpy and tensorflow, you can see that. Lets look at an example,
c = np.array([[3.,4], [5.,6], [6.,7]])
print(np.mean(c,1))
Mean = tf.reduce_mean(c,1)
with tf.Session() as sess:
result = sess.run(Mean)
print(result)
Output
[ 3.5 5.5 6.5]
[ 3.5 5.5 6.5]
Here you can see that when axis(numpy) or reduction_indices(tensorflow) is 1, it computes mean across (3,4) and (5,6) and (6,7), so 1 defines across which axis the mean is computed. When it is 0, the mean is computed across(3,5,6) and (4,6,7), and so on. I hope you get the idea.
Now what are the differences between them?
You can compute the numpy operation anywhere on python. But in order to do a tensorflow operation, it must be done inside a tensorflow Session. You can read more about it here. So when you need to perform any computation for your tensorflow graph(or structure if you will), it must be done inside a tensorflow Session.
Lets look at another example.
npMean = np.mean(c)
print(npMean+1)
tfMean = tf.reduce_mean(c)
Add = tfMean + 1
with tf.Session() as sess:
result = sess.run(Add)
print(result)
We could increase mean by 1 in numpy as you would naturally, but in order to do it in tensorflow, you need to perform that in Session, without using Session you can't do that. In other words, when you are computing tfMean = tf.reduce_mean(c), tensorflow doesn't compute it then. It only computes that in a Session. But numpy computes that instantly, when you write np.mean().
I hope it makes sense.

The key here is the word reduce, a concept from functional programming, which makes it possible for reduce_mean in TensorFlow to keep a running average of the results of computations from a batch of inputs.
If you are not familiar with functional programming, this can seem mysterious. So first let us see what reduce does. If you were given a list like [1,2,5,4] and were told to compute the mean, that is easy - just pass the whole array to np.mean and you get the mean. However what if you had to compute the mean of a stream of numbers? In that case, you would have to first assemble the array by reading from the stream and then call np.mean on the resulting array - you would have to write some more code.
An alternative is to use the reduce paradigm. As an example, look at how we can use reduce in python to calculate the sum of numbers:
reduce(lambda x,y: x+y, [1,2,5,4]).
It works like this:
Step 1: Read 2 digits from the list - 1,2. Evaluate lambda 1,2. reduce stores the result 3. Note - this is the only step where 2 digits are read off the list
Step 2: Read the next digit from the list - 5. Evaluate lambda 5, 3 (3 being the result from step 1, that reduce stored). reduce stores the result 8.
Step 3: Read the next digit from the list - 4. Evaluate lambda 8,4 (8 being the result of step 2, that reduce stored). reduce stores the result 12
Step 4: Read the next digit from the list - there are none, so return the stored result of 12.
Read more here Functional Programming in Python
To see how this applies to TensorFlow, look at the following block of code, which defines a simple graph, that takes in a float and computes the mean. The input to the graph however is not a single float but an array of floats. The reduce_mean computes the mean value over all those floats.
import tensorflow as tf
inp = tf.placeholder(tf.float32)
mean = tf.reduce_mean(inp)
x = [1,2,3,4,5]
with tf.Session() as sess:
print(mean.eval(feed_dict={inp : x}))
This pattern comes in handy when computing values over batches of images. Look at The Deep MNIST Example where you see code like:
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

The new documentation states that tf.reduce_mean() produces the same results as np.mean:
Equivalent to np.mean
It also has absolutely the same parameters as np.mean. But here is an important difference: they produce the same results only on float values:
import tensorflow as tf
import numpy as np
from random import randint
num_dims = 10
rand_dim = randint(0, num_dims - 1)
c = np.random.randint(50, size=tuple([5] * num_dims)).astype(float)
with tf.Session() as sess:
r1 = sess.run(tf.reduce_mean(c, rand_dim))
r2 = np.mean(c, rand_dim)
is_equal = np.array_equal(r1, r2)
print is_equal
if not is_equal:
print r1
print r2
If you will remove type conversion, you will see different results
In additional to this, many other tf.reduce_ functions such as reduce_all, reduce_any, reduce_min, reduce_max, reduce_prod produce the same values as there numpy analogs. Clearly because they are operations, they can be executed only from inside of the session.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Layer normalization in pytorch - python

For calculating the variance use torch.var(a[0], unbiased=False). Then you will get the same result. By default pytorch calculates the unbiased estimation of the variance.

Related

Normally distributed vector given mean, stddev|variance

What is a tensor argument to Normal supposed to mean in Distributions Package of Pytorch?

Understanding np.where() in the context of Logistic Regression

How to normalize all axis with batch normalization?

What is the difference between np.mean and tf.reduce_mean?

Categories

Resources