I have a tensorflow expression where I want to use a different expression depending on whether I'm computing the forward or backward (gradient) pass. Specifically, I want to ignore the effects of some randomness (noise) added into the network during the backwards pass.
Here's a simplified example
import numpy as np
import tensorflow as tf
x = tf.placeholder(tf.float32)
y = x**2
u = tf.random_uniform(tf.shape(x), minval=0.9, maxval=1.1)
yu = y * u
z = tf.sqrt(yu)
g = tf.gradients(z, x)[0]
with tf.Session() as sess:
yv, yuv, zv, gv = sess.run([y,yu,z,g], {x: [-2, -1, 1]})
print(yv)
print(yuv)
print(zv)
print(gv)
which outputs something like
[4. 1. 1.]
[4.1626534 0.9370764 1.0806011]
[2.0402582 0.96802706 1.0395197 ]
[-1.0201291 -0.96802706 1.0395197 ]
The last values here are the derivative of z with respect to x. I would like them to not include the multiplicative noise term u, i.e. they should consistently be [-1, -1, 1] for these input values of x.
Is there a way to do such a thing only using Python? I know I can make a custom operator in C and define a custom gradient for it, but I'd like to avoid this if possible.
Also, I'm hoping to use this as part of a Keras layer, so a Keras-based solution would be an alternative (i.e. if one could define a different expression for the forwards and backwards pass through a Keras layer). This does mean that just defining a second expression z2 = tf.sqrt(y) and calling gradients on that isn't a solution for me, though, because I don't know how I would put that in Keras (since in Keras, it will be part of a very long computational graph).
The short answer is that Sergey Ioffe's trick, which you mentioned above, will only work if it's applied at the very end of the graph, right before the gradient computation.
I am assuming that you tried the following, which will not work:
yu_fixed = tf.stop_gradient(yu - y) + y
z = tf.sqrt(yu_fixed)
This still outputs random-tainted gradients.
To see why, let's follow along the gradient computation. Let's use s as shorthand for tf.stop_gradient. The way this works is that when TensorFlow needs to compute s(expr), it just returns expr, but when it needs to compute the gradient of s(expr), it returns 0.
We want to compute the gradient of z = sqrt(s(yu - y) + y). Now, because
,
we find that the gradient of z contains both a term with the derivative of s(), but also a term containing s() itself. This latter term will not zero out the s() portion, so the computed derivative of z will depend (in some odd and incorrect way) on the value yu. This is why the above solution still contains randomness in the gradient.
As far as I can see, the only way to work around this is to apply Ioffe's trick as the last stage before the tf.gradient. In other words, if you do something like the following, you will get the expected result:
x = tf.placeholder(tf.float32)
y = x**2
u = tf.random_uniform(tf.shape(x), minval=0.9, maxval=1.1)
yu = y * u
z = tf.sqrt(yu)
z_fixed = tf.stop_gradient(z - tf.sqrt(y)) + tf.sqrt(y)
g = tf.gradients(z_fixed, x)[0]
with tf.Session() as sess:
yv, yuv, zv, gv = sess.run([y,yu,z_fixed,g], {x: [-2, -1, 1]})
print(yv)
print(yuv)
print(zv)
print(gv)
Output:
[ 4. 1. 1.]
[ 3.65438652 1.07519293 0.94398856]
[ 1.91164494 1.03691506 0.97159076]
[-1. -1. 1.]
Related
I'm trying to test layer normalization function of PyTorch.
But I don't know why b[0] and result have different values here
Did I do something wrong ?
import numpy as np
import torch
import torch.nn as nn
a = torch.randn(1, 5)
m = nn.LayerNorm(a.size()[1:], elementwise_affine= False)
b = m(a)
Result:
input: a[0] = tensor([-1.3549, 0.3857, 0.1110, -0.8456, 0.1486])
output: b[0] = tensor([-1.5561, 1.0386, 0.6291, -0.7967, 0.6851])
mean = torch.mean(a[0])
var = torch.var(a[0])
result = (a[0]-mean)/(torch.sqrt(var+1e-5))
Result:
result = tensor([-1.3918, 0.9289, 0.5627, -0.7126, 0.6128])
And, for n*2 normalization , the result of pytorch layer norm is always [1.0 , -1.0] (or [-1.0, 1.0]) . I can't understand why. Please let me know if you have any hints
a = torch.randn(1, 2)
m = nn.LayerNorm(a.size()[1:], elementwise_affine= False)
b = m(a)
Result:
b = tensor([-1.0000, 1.0000])
For calculating the variance use torch.var(a[0], unbiased=False). Then you will get the same result. By default pytorch calculates the unbiased estimation of the variance.
For your 1st question, as #Theodor said, you need to use unbiased=False unbiased when calculating variance.
Only if you want to explore more: As your input size is 5, unbiased estimation of variance will be 5/4 = 1.25 times the biased estimation. Because unbiased estimation uses N-1 instead of N in the denominator. As a result, each value of result that you generated, is sqrt(4/5) = 0.8944 times the values of b[0].
About your 2nd question:
And, for n*2 normalization , the result of pytorch layer norm is always [1.0 , -1.0]
This is reasonable. Suppose only two elements are a and b. So, mean will be (a+b)/2 and variance ((a-b)^2)/4. So, the normalization result will be [((a-b)/2) / (sqrt(variance)) ((b-a)/2) / (sqrt(variance))] which is essentially [1, -1] or [-1, 1] depending on a > b or a < b.
I am currently studying the Deep Learning specialization taught on Coursera by Andrew Ng. In the first assignment, I have to define a prediction function, and wanted to know if my alternative solution is as valid as the actual solution.
Please let me know if my understanding of the np.where() function is correct as I have commented on this in the code under "ALTERNATIVE SOLUTION COMMENTS". Also, it would be much appreciated if my understanding under the "ACTUAL SOLUTION COMMENTS" could be checked as well.
The alternate solution that uses np.where() also works when I try to increase the number of examples/inputs in X the current amount (m = 3), to 4, to 5, and so on.
Let me know what you think, and if both solutions are just as good as the other! Thanks.
def predict(w, b, X):
'''
Predict whether the label is 0 or 1 using learned logistic regression parameters (w, b)
Arguments:
w -- weights, a numpy array of size (num_px * num_px * 3, 1)
b -- bias, a scalar
X -- data of size (num_px * num_px * 3, number of examples)
Returns:
Y_prediction -- a numpy array (vector) containing all predictions (0/1) for the examples in X
'''
m = X.shape[1]
Y_prediction = np.zeros((1,m)) # Initialize Y_prediction as an array of zeros
w = w.reshape(X.shape[0], 1)
# Compute vector "A" predicting the probabilities of a cat being present in the picture
### START CODE HERE ### (≈ 1 line of code)
A = sigmoid(np.dot(w.T, X) + b) # Note: The shape of A will always be a (1,m) row vector
### END CODE HERE ###
for i in range(A.shape[1]): # for i in range(# of examples in A = # of examples in our set)
# Convert probabilities A[0,i] to actual predictions p[0,i]
### START CODE HERE ### (≈ 4 lines of code)
Y_prediction[0, i] = 1 if A[0, i] > 0.5 else 0
'''
ACTUAL SOLUTION COMMENTS:
The above reads as:
Change/update the i-th value of Y_prediction to 1 if the corresponding i-th value in A is > 0.5.
Otherwise, change/update the i-th value of Y_prediction to 0.
'''
'''
ALTERNATIVE SOLUTION COMMENTS:
To condense this code, you could delete the for loop and Y_prediction var from the top,
and then use the following one line:
return np.where(A > 0.5, np.ones((1,m)), np.zeros((1,m)))
This reads as:
Given the condition > 0.5, return np.ones((1,m)) if True,
or return np.zeros((1,m)) if False.
Another way to understand this is as follows:
Tell me where in the array A, entries satisfies the condition A > 0.5,
At those positions, give me np.ones((1,m)), otherwise, give me
np.zeros((1,m))
'''
### END CODE HERE ###
assert(Y_prediction.shape == (1, m))
return Y_prediction
w = np.array([[0.1124579],[0.23106775]])
b = -0.3
X = np.array([[1.,-1.1,-3.2],[1.2,2.,0.1]])
print(sigmoid(np.dot(w.T, X) + b))
print ("predictions = " + str(predict(w, b, X))) # Output gives 1,1,0 as expected
Your alternative approach seems fine. As a remark, I'll add that you don't even need the np.ones and np.zeros, you can just specify directly the integers 0 and 1. When using np.where, as long as X and y (the values to replace according to the condition) and the same condition are broadcastable, it should work fine.
Here's a simple example:
y_pred = np.random.rand(1,6).round(2)
# array([[0.53, 0.54, 0.68, 0.34, 0.53, 0.46]])
np.where(y_pred> 0.5, np.ones((1,6)), np.zeros((1,6)))
# array([[1., 1., 1., 0., 1., 0.]])
And using integers:
np.where(y_pred> 0.5,1,0)
# array([[1, 1, 1, 0, 1, 0]])
As per your comments on how the function works it is indeed working as you describe. Perhaps just instead of To condense this code, I'd argue that using numpy makes it more efficient, and also intelligible in this case.
I want to replace all 0s in a 2-D tensor with -5.
With dataframe, i can easily do this:
df = df.mask(df=0, -5)
but this does not work for tensors. I have tried:
y = torch.where(y = 0, -5, y)
it is simple, just use this
y[y==0]=-5
There are two general ways.
One, given above by prhmma is to use in-place mutation like y[y == 0] = -5. It is nice and efficient, but will break autograd operation. So if you want gradient to flow through y, you should not do that.
The other is to use torch.where, as you have attempted. The proper incantation is
y = torch.where(y == 0, torch.tensor(-5), y)
or, if you want to be device- and dtype-agnostic
five = torch.tensor(-5, dtype=y.dtype, device=y.device)
y = torch.where(y == 0, five, y)
the fact that where does not accept scalars is an annoying papercut, but that's how it is ATM. Note that while the choice itself is discrete and obviously not differentiable, this operation will let gradients flow through both operands.
I have two tensors.
A tensor of shape (1,N)
A tensor of shape (N,T)
What I want to calculate is the following scalar:
tf.reduce_sum seemed helpful, but I couldn't get my head around combining the two tensors and reduce functions to get what I want. Can someone help me how to write the above equation in tensorflow?
Does this work?
import tensorflow as tf
import numpy as np
N = 10
T = 20
l = tf.constant(np.random.randn(1, N), dtype=tf.float32)
z = tf.constant(np.random.randn(N, T), dtype=tf.float32)
with tf.Session() as sess:
# swap axis for broadcasting to work
l = tf.transpose(l, [1, 0])
z_div_l = tf.divide(z, l)
z_div_l_2 = tf.divide(1.0 - z, 1.0 - l)
result = tf.reduce_sum(tf.add(z_div_l, z_div_l_2), axis=0)
eval_result = sess.run(result)
print('{}\n{}'.format(eval_result.shape, eval_result))
This calculates the above expression for every t from 0 to T-1, so it is not a scalar but a vector of size (T,). Your question mentions you want to compute just one scalar, but the sum is only over N and not over T, so I assumed you just want this expression to be evaluated for every t.
In the MNIST beginner tutorial, there is the statement
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
tf.cast basically changes the type of tensor the object is, but what is the difference between tf.reduce_mean and np.mean?
Here is the doc on tf.reduce_mean:
reduce_mean(input_tensor, reduction_indices=None, keep_dims=False, name=None)
input_tensor: The tensor to reduce. Should have numeric type.
reduction_indices: The dimensions to reduce. If None (the defaut), reduces all dimensions.
# 'x' is [[1., 1. ]]
# [2., 2.]]
tf.reduce_mean(x) ==> 1.5
tf.reduce_mean(x, 0) ==> [1.5, 1.5]
tf.reduce_mean(x, 1) ==> [1., 2.]
For a 1D vector, it looks like np.mean == tf.reduce_mean, but I don't understand what's happening in tf.reduce_mean(x, 1) ==> [1., 2.]. tf.reduce_mean(x, 0) ==> [1.5, 1.5] kind of makes sense, since mean of [1, 2] and [1, 2] is [1.5, 1.5], but what's going on with tf.reduce_mean(x, 1)?
The functionality of numpy.mean and tensorflow.reduce_mean are the same. They do the same thing. From the documentation, for numpy and tensorflow, you can see that. Lets look at an example,
c = np.array([[3.,4], [5.,6], [6.,7]])
print(np.mean(c,1))
Mean = tf.reduce_mean(c,1)
with tf.Session() as sess:
result = sess.run(Mean)
print(result)
Output
[ 3.5 5.5 6.5]
[ 3.5 5.5 6.5]
Here you can see that when axis(numpy) or reduction_indices(tensorflow) is 1, it computes mean across (3,4) and (5,6) and (6,7), so 1 defines across which axis the mean is computed. When it is 0, the mean is computed across(3,5,6) and (4,6,7), and so on. I hope you get the idea.
Now what are the differences between them?
You can compute the numpy operation anywhere on python. But in order to do a tensorflow operation, it must be done inside a tensorflow Session. You can read more about it here. So when you need to perform any computation for your tensorflow graph(or structure if you will), it must be done inside a tensorflow Session.
Lets look at another example.
npMean = np.mean(c)
print(npMean+1)
tfMean = tf.reduce_mean(c)
Add = tfMean + 1
with tf.Session() as sess:
result = sess.run(Add)
print(result)
We could increase mean by 1 in numpy as you would naturally, but in order to do it in tensorflow, you need to perform that in Session, without using Session you can't do that. In other words, when you are computing tfMean = tf.reduce_mean(c), tensorflow doesn't compute it then. It only computes that in a Session. But numpy computes that instantly, when you write np.mean().
I hope it makes sense.
The key here is the word reduce, a concept from functional programming, which makes it possible for reduce_mean in TensorFlow to keep a running average of the results of computations from a batch of inputs.
If you are not familiar with functional programming, this can seem mysterious. So first let us see what reduce does. If you were given a list like [1,2,5,4] and were told to compute the mean, that is easy - just pass the whole array to np.mean and you get the mean. However what if you had to compute the mean of a stream of numbers? In that case, you would have to first assemble the array by reading from the stream and then call np.mean on the resulting array - you would have to write some more code.
An alternative is to use the reduce paradigm. As an example, look at how we can use reduce in python to calculate the sum of numbers:
reduce(lambda x,y: x+y, [1,2,5,4]).
It works like this:
Step 1: Read 2 digits from the list - 1,2. Evaluate lambda 1,2. reduce stores the result 3. Note - this is the only step where 2 digits are read off the list
Step 2: Read the next digit from the list - 5. Evaluate lambda 5, 3 (3 being the result from step 1, that reduce stored). reduce stores the result 8.
Step 3: Read the next digit from the list - 4. Evaluate lambda 8,4 (8 being the result of step 2, that reduce stored). reduce stores the result 12
Step 4: Read the next digit from the list - there are none, so return the stored result of 12.
Read more here Functional Programming in Python
To see how this applies to TensorFlow, look at the following block of code, which defines a simple graph, that takes in a float and computes the mean. The input to the graph however is not a single float but an array of floats. The reduce_mean computes the mean value over all those floats.
import tensorflow as tf
inp = tf.placeholder(tf.float32)
mean = tf.reduce_mean(inp)
x = [1,2,3,4,5]
with tf.Session() as sess:
print(mean.eval(feed_dict={inp : x}))
This pattern comes in handy when computing values over batches of images. Look at The Deep MNIST Example where you see code like:
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
The new documentation states that tf.reduce_mean() produces the same results as np.mean:
Equivalent to np.mean
It also has absolutely the same parameters as np.mean. But here is an important difference: they produce the same results only on float values:
import tensorflow as tf
import numpy as np
from random import randint
num_dims = 10
rand_dim = randint(0, num_dims - 1)
c = np.random.randint(50, size=tuple([5] * num_dims)).astype(float)
with tf.Session() as sess:
r1 = sess.run(tf.reduce_mean(c, rand_dim))
r2 = np.mean(c, rand_dim)
is_equal = np.array_equal(r1, r2)
print is_equal
if not is_equal:
print r1
print r2
If you will remove type conversion, you will see different results
In additional to this, many other tf.reduce_ functions such as reduce_all, reduce_any, reduce_min, reduce_max, reduce_prod produce the same values as there numpy analogs. Clearly because they are operations, they can be executed only from inside of the session.