Suppose I multiply a vector with a scalar, e.g.:
a = tf.Variable(3.)
b = tf.Variable([1., 0., 1.])
with tf.GradientTape() as tape:
c = a*b
grad = tape.gradient(c, a)
The resulting gradient I get is a scalar,
<tf.Tensor: shape=(), dtype=float32, numpy=2.0>
whereas we would expect the vector:
<tf.Variable 'Variable:0' shape=(3,) dtype=float32, numpy=array([1., 0., 1.], dtype=float32)>
Looking at other examples, it appears that tensorflow sums the expected vector, also for scalar-matrix multiplication and so on.
Why does tensorflow do this? This can probably be avoided using #custum_gradient, is there another less cumbersome way to get the correct gradient?
There are appear to be some related questions but these all seem to consider a the gradient of a loss function that aggregates over a training-batch. No loss function or aggregation is used here, so I think the issue is something else?
You're getting scaler value because you took the gradient wrt scaler. You would get a vector if you took grad wrt some vector. Take a look to the following example:
import tensorflow as tf
a = tf.Variable(3., trainable=True)
b = tf.Variable([1., 0, 1.], trainable=True)
c = tf.Variable(2., trainable=True)
d = tf.Variable([2., 1, 2.], trainable=True)
with tf.GradientTape(persistent=True) as tape:
e = a*b*c*d # abcd , abcd , abcd
tf.print(e)
grad = tape.gradient(e, [a, b, c, d])
grad[0].numpy(), grad[1].numpy(), grad[2].numpy(), grad[3].numpy()
[12 0 12]
(8.0,
array([12., 6., 12.], dtype=float32),
12.0,
array([6., 0., 6.], dtype=float32))
Formally, what I was looking for was the differential of the vector-field that is function of the variable a. For a vector-field the differential is the same as the Jacobian. It turns out that what I was looking for can be done by tape.jacobian.
Related
I'm trying to reimplement the Categorical Cross Entropy loss from Keras so that I can customize it. I got the following
def CustomCrossEntropy(output, target, axis=-1):
target = ops.convert_to_tensor_v2_with_dispatch(target)
output = ops.convert_to_tensor_v2_with_dispatch(output)
target.shape.assert_is_compatible_with(output.shape)
# scale preds so that the class probas of each sample sum to 1
output = output / math_ops.reduce_sum(output, axis, True)
# Compute cross entropy from probabilities.
epsilon_ = _constant_to_tensor(epsilon(), output.dtype.base_dtype)
output = clip_ops.clip_by_value(output, epsilon_, 1. - epsilon_)
return -math_ops.reduce_sum(target * math_ops.log(output), axis)
It produces different results than the internal function which confuses me, as I just copied the code from github so far. What am I missing here?
Prove:
y_true = [[0., 1., 0.], [0., 0., 1.]]
y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
loss = tf.keras.losses.categorical_crossentropy(y_true, y_pred)
customLoss = CustomCrossEntropy(y_true, y_pred)
assert loss.shape == (2,)
print(loss)
print(customLoss)
>>tf.Tensor([0.05129331 2.3025851 ], shape=(2,), dtype=float32)
>>tf.Tensor([ 0.8059049 14.506287 ], shape=(2,), dtype=float32)
You have inverted the arguments of the function in your definition of CustomCrossEntropy, if you double check the source code in GitHub you will see that the first argument is target and the second one is output. If you switch them back you will get the same results as expected.
import tensorflow as tf
from tensorflow.keras.backend import categorical_crossentropy as CustomCrossEntropy
y_true = [[0., 1., 0.], [0., 0., 1.]]
y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
y_true = tf.convert_to_tensor(y_true)
y_pred = tf.convert_to_tensor(y_pred)
loss = tf.keras.losses.categorical_crossentropy(y_true, y_pred)
print(loss)
# tf.Tensor([0.05129331 2.3025851 ], shape=(2,), dtype=float32)
loss = CustomCrossEntropy(y_true, y_pred)
print(loss)
# tf.Tensor([0.05129331 2.3025851 ], shape=(2,), dtype=float32)
loss = CustomCrossEntropy(y_pred, y_true)
print(loss)
# tf.Tensor([ 0.8059049 14.506287 ], shape=(2,), dtype=float32)
I'm still working on my understanding of the PyTorch autograd system. One thing I'm struggling at is to understand why .clamp(min=0) and nn.functional.relu() seem to have different backward passes.
It's especially confusing as .clamp is used equivalently to relu in PyTorch tutorials, such as https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-nn.
I found this when analysing the gradients of a simple fully connected net with one hidden layer and a relu activation (linear in the outputlayer).
to my understanding the output of the following code should be just zeros. I hope someone can show me what I am missing.
import torch
dtype = torch.float
x = torch.tensor([[3,2,1],
[1,0,2],
[4,1,2],
[0,0,1]], dtype=dtype)
y = torch.ones(4,4)
w1_a = torch.tensor([[1,2],
[0,1],
[4,0]], dtype=dtype, requires_grad=True)
w1_b = w1_a.clone().detach()
w1_b.requires_grad = True
w2_a = torch.tensor([[-1, 1],
[-2, 3]], dtype=dtype, requires_grad=True)
w2_b = w2_a.clone().detach()
w2_b.requires_grad = True
y_hat_a = torch.nn.functional.relu(x.mm(w1_a)).mm(w2_a)
y_a = torch.ones_like(y_hat_a)
y_hat_b = x.mm(w1_b).clamp(min=0).mm(w2_b)
y_b = torch.ones_like(y_hat_b)
loss_a = (y_hat_a - y_a).pow(2).sum()
loss_b = (y_hat_b - y_b).pow(2).sum()
loss_a.backward()
loss_b.backward()
print(w1_a.grad - w1_b.grad)
print(w2_a.grad - w2_b.grad)
# OUT:
# tensor([[ 0., 0.],
# [ 0., 0.],
# [ 0., -38.]])
# tensor([[0., 0.],
# [0., 0.]])
#
The reason is that clamp and relu produce different gradients at 0. Checking with a scalar tensor x = 0 the two versions: (x.clamp(min=0) - 1.0).pow(2).backward() versus (relu(x) - 1.0).pow(2).backward(). The resulting x.grad is 0 for the relu version but it is -2 for the clamp version. That means relu chooses x == 0 --> grad = 0 while clamp chooses x == 0 --> grad = 1.
When I use the assign method of tf.Variable to change the value of a variable, it brakes the tf.Gradient, e. g., see the code for a toy example below:
(NOTE: I am interested in TensorFlow 2 only.)
x = tf.Variable([[2.0,3.0,4.0], [1.,10.,100.]])
patch = tf.Variable([[0., 1.], [2., 3.]])
with tf.GradientTape() as g:
g.watch(patch)
x[:2,:2].assign(patch)
y = tf.tensordot(x, tf.transpose(x), axes=1)
o = tf.reduce_mean(y)
do_dpatch = g.gradient(o, patch)
Then it gives me None for the do_dpatch.
Note that if I do the following it works perfectly fine:
x = tf.Variable([[2.0,3.0,4.0], [1.,10.,100.]])
patch = tf.Variable([[0., 1.], [2., 3.]])
with tf.GradientTape() as g:
g.watch(patch)
x[:2,:2].assign(patch)
y = tf.tensordot(x, tf.transpose(x), axes=1)
o = tf.reduce_mean(y)
do_dx = g.gradient(o, x)
and gives me:
>>>do_dx
<tf.Tensor: id=106, shape=(2, 3), dtype=float32, numpy=
array([[ 1., 2., 52.],
[ 1., 2., 52.]], dtype=float32)>
This behavior does make sense. Let's take your first example
x = tf.Variable([[2.0,3.0,4.0], [1.,10.,100.]])
patch = tf.Variable([[1., 1.], [1., 1.]])
with tf.GradientTape() as g:
g.watch(patch)
x[:2,:2].assign(patch)
y = tf.tensordot(x, tf.transpose(x), axes=1)
dy_dx = g.gradient(y, patch)
You are computing dy/d(patch). But your y depends on x only not on patch. Yes, you do assign values to x from patch. But this operation doesn't carry a reference to the patch Variable. It just copies the values.
In short, you are trying to get a gradient w.r.t something it doesn't depend on. So you will get None.
Let's look at the second example and why it works.
x = tf.Variable([[2.0,3.0,4.0], [1.,10.,100.]])
with tf.GradientTape() as g:
g.watch(x)
x[:2,:2].assign([[1., 1.], [1., 1.]])
y = tf.tensordot(x, tf.transpose(x), axes=1)
dy_dx = g.gradient(y, x)
This example is perfectly fine. Y depends on x and you are computing dy/dx. So you'd get actual gradients in this example.
As explained HERE (see the quote below from alextp) tf.assign does not support gradient.
"There is no plan to add a gradient to tf.assign because it's not possible in general to connect the uses of the assigned variable with the graph which assigned it."
So, the above problem can be resolved by the following code:
x= tf.Variable([[0.0,0.0,4.0], [0.,0.,100.]])
patch = tf.Variable([[0., 1.], [2., 3.]])
with tf.GradientTape() as g:
g.watch(patch)
padding = tf.constant([[0, 0], [0, 1]])
padde_patch = tf.pad(patch, padding, mode='CONSTANT', constant_values=0)
revised_x = x+ padde_patch
y = tf.tensordot(revised_x, tf.transpose(revised_x), axes=1)
o = tf.reduce_mean(y)
do_dpatch = g.gradient(o, patch)
which results in
do_dpatch
<tf.Tensor: id=65, shape=(2, 2), dtype=float32, numpy=
array([[1., 2.],
[1., 2.]], dtype=float32)>
Problem description
I have inputs x that are indicator variables, and outputs y, where each row is a random one-hot vector that depends on the values of x (data sample shown below).
I want to train a model that essentially learns the probabilistic relationship between x and y in the form of per-column weights. The model must "choose" one, and only one, indicator to output. My current approach is to sample a categorical random variable and produce a one-hot vector as a prediction.
The issue is that I'm getting an error ValueError: An operation has `None` for gradient when I try to train my Keras model.
I find this error odd, because I've trained mixture networks using Keras and Tensorflow, which use tf.contrib.distributions.Categorical, and I did not run into any gradient-related issues.
Code
Experiment
import tensorflow as tf
import tensorflow.contrib.distributions as tfd
import numpy as np
from keras import backend as K
from keras.layers import Layer
from keras.models import Sequential
from keras.utils import to_categorical
def make_xy_prob(rng, size=10000):
rng = np.random.RandomState(rng) if isinstance(rng, int) else rng
cols = 3
weights = np.array([[1, 2, 3]])
# generate data and drop zeros for now
x = rng.choice(2, (size, cols))
is_zeros = x.sum(axis=1) == 0
x = x[~is_zeros]
# use weights to create probabilities for determining y
weighted_x = x * weights
prob_x = weighted_x / weighted_x.sum(axis=1, keepdims=True)
y = np.row_stack([to_categorical(rng.choice(cols, p=p), cols) for p in prob_x])
# add zeros back and shuffle
zeros = np.zeros(((size - len(x), cols)))
x = np.row_stack([x, zeros])
y = np.row_stack([y, zeros])
shuffle_idx = rng.permutation(size)
x = x[shuffle_idx]
y = y[shuffle_idx]
return x, y
class OneHotGate(Layer):
def build(self, input_shape):
self.kernel = self.add_weight(name='kernel', shape=(1, input_shape[1]), initializer='ones')
def call(self, x):
zero_cond = x < 1
x_shape = tf.shape(x)
# weight indicators so that more probability is assigned to more likely columns
weighted_x = x * self.kernel
# fill zeros with -inf so that zero probability is assigned to that column
ninf_fill = tf.fill(x_shape, -np.inf)
masked_x = tf.where(zero_cond, ninf_fill, weighted_x)
onehot_gate = tf.squeeze(tfd.OneHotCategorical(logits=masked_x, dtype=x.dtype).sample(1))
# fill gate with zeros where input was originally zero
zeros_fill = tf.fill(x_shape, 0.0)
masked_gate = tf.where(zero_cond, zeros_fill, onehot_gate)
return masked_gate
def experiment(epochs=10):
K.clear_session()
rng = np.random.RandomState(2)
X, y = make_xy_prob(rng)
input_shape = (X.shape[1], )
model = Sequential()
gate_layer = OneHotGate(input_shape=input_shape)
model.add(gate_layer)
model.compile('adam', 'categorical_crossentropy')
model.fit(X, y, 64, epochs, verbose=1)
Data sample
>>> x
array([[1., 1., 1.],
[0., 1., 0.],
[1., 0., 1.],
...,
[1., 1., 1.],
[1., 1., 1.],
[1., 1., 0.]])
>>> y
array([[0., 0., 1.],
[0., 1., 0.],
[1., 0., 0.],
...,
[0., 0., 1.],
[1., 0., 0.],
[1., 0., 0.]])
Error
ValueError: An operation has `None` for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.
The problem lies in the fact that in OneHotCategorical performs a discontinuous sampling - what causes gradient computation to fail. In order to replace this discontinuous sampling with a continuous (relaxed) version one may try to use RelaxedOneHotCategorical (which is based on interesting Gumbel Softmax technique).
I want to feed a batch_size integer as a placeholder in Tensorflow. But it does not act as an integer. Consider the following example:
import tensorflow as tf
max_length = 5
batch_size = 3
batch_size_placeholder = tf.placeholder(dtype=tf.int32)
mask_0 = tf.one_hot(indices=[0]*batch_size_placeholder, depth=max_length, on_value=0., off_value=1.)
mask_1 = tf.one_hot(indices=[0]*batch_size, depth=max_length, on_value=0., off_value=1.)
# new session
with tf.Session() as sess:
feed = {batch_size_placeholder : 3}
batch, mask0, mask1 = sess.run([
batch_size_placeholder, mask_0, mask_1
], feed_dict=feed)
When I print the values of batch, mask0 and mask1 I have the following:
print(batch)
>>> array(3, dtype=int32)
print(mask0)
>>> array([[0., 1., 1., 1., 1.]], dtype=float32)
print(mask1)
>>> array([[0., 1., 1., 1., 1.],
[0., 1., 1., 1., 1.],
[0., 1., 1., 1., 1.]], dtype=float32)
Indeed I thought mask0 and mask1 must be the same, but it seems that Tensorflow does not treat batch_size_placeholder as an integer. I believe it would be a tensor, but is there anyway that I can use it as an integer in my computations?
Is there anyway I can fix this problem? Just FYI, I used tf.one_hot as just an example, I want to run train/validation during training in my code where I will need a lot of other computations with different values for batch_size in training and in validation steps.
Any help would be appreciated.
In pure python usage, [0]*3 will be [0,0,0]. However, batch_size_placeholder is a placeholder, during the graph execution, it will be a tensor. [0]*tensor will be regarded as tensor multiplication. In your case, it will be a 1-d tensor which has 0 value. To correctly use batch_size_placeholder, you should create a tensor which has the same length as batch_size_placeholder.
mask_0 = tf.one_hot(tf.zeros(batch_size_placeholder, dtype=tf.int32), depth=max_length, on_value=0., off_value=1.)
It will have the same result as mask_1.
A simple example to show the difference.
batch_size_placeholder = tf.placeholder(dtype=tf.int32)
a = [0]*batch_size_placeholder
b = tf.zeros(batch_size_placeholder, dtype=tf.int32)
with tf.Session() as sess:
print(sess.run([a, b], feed_dict={batch_size_placeholder : 3}))
# [array([0], dtype=int32), array([0, 0, 0], dtype=int32)]