Keras Categorical_crossentropy loss implementation - python

I'm trying to reimplement the Categorical Cross Entropy loss from Keras so that I can customize it. I got the following
def CustomCrossEntropy(output, target, axis=-1):
target = ops.convert_to_tensor_v2_with_dispatch(target)
output = ops.convert_to_tensor_v2_with_dispatch(output)
target.shape.assert_is_compatible_with(output.shape)
# scale preds so that the class probas of each sample sum to 1
output = output / math_ops.reduce_sum(output, axis, True)
# Compute cross entropy from probabilities.
epsilon_ = _constant_to_tensor(epsilon(), output.dtype.base_dtype)
output = clip_ops.clip_by_value(output, epsilon_, 1. - epsilon_)
return -math_ops.reduce_sum(target * math_ops.log(output), axis)
It produces different results than the internal function which confuses me, as I just copied the code from github so far. What am I missing here?
Prove:
y_true = [[0., 1., 0.], [0., 0., 1.]]
y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
loss = tf.keras.losses.categorical_crossentropy(y_true, y_pred)
customLoss = CustomCrossEntropy(y_true, y_pred)
assert loss.shape == (2,)
print(loss)
print(customLoss)
>>tf.Tensor([0.05129331 2.3025851 ], shape=(2,), dtype=float32)
>>tf.Tensor([ 0.8059049 14.506287 ], shape=(2,), dtype=float32)

You have inverted the arguments of the function in your definition of CustomCrossEntropy, if you double check the source code in GitHub you will see that the first argument is target and the second one is output. If you switch them back you will get the same results as expected.
import tensorflow as tf
from tensorflow.keras.backend import categorical_crossentropy as CustomCrossEntropy
y_true = [[0., 1., 0.], [0., 0., 1.]]
y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
y_true = tf.convert_to_tensor(y_true)
y_pred = tf.convert_to_tensor(y_pred)
loss = tf.keras.losses.categorical_crossentropy(y_true, y_pred)
print(loss)
# tf.Tensor([0.05129331 2.3025851 ], shape=(2,), dtype=float32)
loss = CustomCrossEntropy(y_true, y_pred)
print(loss)
# tf.Tensor([0.05129331 2.3025851 ], shape=(2,), dtype=float32)
loss = CustomCrossEntropy(y_pred, y_true)
print(loss)
# tf.Tensor([ 0.8059049 14.506287 ], shape=(2,), dtype=float32)

Related

How to use 'Keras symbolic inputs' with 'tf.while_loop'?

I'm trying to create N x N tensor using tf.while_loop in my custom Keras layer. Here, N (timesteps in code) is a Keras symbolic tensor (integer scalar). The below code is __call__ method of my custom Keras layer in Functional Model.
import tensorflow as tf
from keras import backend as K
# timesteps = tf.constant(7) ## This makes this code work!!
timesteps = K.shape(inputs)[1] ## Or equivalently provided by timesteps = keras.layers.Input(shape= (), batch_size= 1, name= "timesteps")
# timesteps = tf.convert_to_tensor(timesteps) ## Does not work.
idx_outer = tf.constant(0)
timesteps_mixed_outer = tf.reshape(tf.Variable([]), (0, timesteps))
# timesteps_mixed_outer = Lambda(lambda timesteps : tf.reshape(tf.Variable([]), (0, timesteps)))(timesteps) ## Does not work
def body_inner(idx_inner, idx_outer, timesteps_mixed_inner):
timesteps_mixed_inner = tf.concat([timesteps_mixed_inner, [tf.cond(idx_inner == idx_outer, lambda : True, lambda : False)]], axis = 0)
return idx_inner + 1, idx_outer, timesteps_mixed_inner
def body_outer(idx_outer, timesteps_mixed_outer):
timesteps_mixed_inner = tf.Variable([])
idx_inner = tf.constant(0)
idx_inner, idx_outer, timesteps_mixed_inner = tf.while_loop(lambda idx_inner, idx_outer, timesteps_mixed_inner: K.less(idx_inner, timesteps), body_inner, [idx_inner, idx_outer, timesteps_mixed_inner], shape_invariants= [idx_inner.get_shape(), idx_outer.get_shape(), tf.TensorShape([None])])
timesteps_mixed_outer = tf.concat([timesteps_mixed_outer, [timesteps_mixed_inner]], axis = 0)
return idx_outer + 1, timesteps_mixed_outer
idx_outer, timesteps_mixed_outer = tf.while_loop(lambda idx_outer, timesteps_mixed_outer: K.less(idx_outer, timesteps), body_outer, [idx_outer, timesteps_mixed_outer], shape_invariants= [idx_outer.get_shape(), tf.TensorShape([None, None])]) ## Here raises error
The last line of above code raises the following error:
Exception has occurred: TypeError
Could not build a TypeSpec for <KerasTensor: shape=(0, None) dtype=float32 (created by layer 'tf.reshape')> with type KerasTensor
What I have tried:
I suspected the problem is came from Keras symbolic tensor input 'timesteps', so I have changed to timesteps = tf.constant(7) for experimental purpose. Then the code works and 'timesteps_mixed_outer' has the desired values:
<tf.Tensor: shape=(7, 7), dtype=float32, numpy=
array([[1., 0., 0., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0., 0., 1.]], dtype=float32)>
I suspected the problem comes the use of from Keras symbolic tensor timesteps in tf.reshape function, so I have initialized timesteps_mixed_outer = tf.reshape(tf.Variable([]), (0, 7)) and leave timesteps = K.shape(inputs)[1]. Then new error occurs:
Exception has occurred: TypeError
Keras symbolic inputs/outputs do not implement `__len__`. You may be trying to pass Keras symbolic inputs/outputs to a TF API that does not register dispatching, preventing Keras from automatically converting the API call to a lambda layer in the Functional Model. This error will also get raised if you try asserting a symbolic input/output directly.
I have also tried to wrap tf.reshape following two solutions suggested in TypeError: Could not build a TypeSpec for <KerasTensor when using tf.map_fn and keras functional model, but both raise the same error.
My environments is as follows:
MacOS 12.0.1
Python 3.7.3
keras-preprocessing [installed: 1.1.2]
keras.__version__ == 2.4.3
tensorflow [installed: 2.4.1]
tensorflow-estimator [installed: 2.4.0]
EDIT
This error is raised when I build Keras model, before feeding actual Numpy values. timesteps = K.shape(inputs)[1] is varying across inputs, so it is set to None as like a batch dimension.
timesteps = K.shape(inputs)[1]
==
<KerasTensor: shape=() dtype=int32 inferred_value=[None] (created by layer 'tf.__operators__.getitem_6')>
==
dtype:tf.int32
is_tensor_like:True
name:'tf.__operators__.getitem_6/strided_slice:0'
op:'Traceback (most recent call last):\n File "/Users/imgspoints/.vscode/extensions/ms-python.python-2022.2.1924087327/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_resolver.py", line 193, in _get_py_dictionary\n attr = getattr(var, name)\n File "/Users/imgspoints/.local/share/virtualenvs/experiments-m6CLaaa4/lib/python3.7/site-packages/tensorflow/python/keras/engine/keras_tensor.py", line 251, in op\n raise TypeError(\'Keras symbolic inputs/outputs do not \'\nTypeError: Keras symbolic inputs/outputs do not implement `op`. You may be trying to pass Keras symbolic inputs/outputs to a TF API that does not register dispatching, preventing Keras from automatically converting the API call to a lambda layer in the Functional Model.\n'
shape:TensorShape([])
type_spec:TensorSpec(shape=(), dtype=tf.int32, name=None)
_inferred_value:[None]
_keras_history:KerasHistory(layer=<tensorflow.python.keras.layers.core.SlicingOpLambda object at 0x1774fac88>, node_index=0, tensor_index=0)
_name:'tf.__operators__.getitem_6/strided_slice:0'
_overload_all_operators:<bound method KerasTensor._overload_all_operators of <class 'tensorflow.python.keras.engine.keras_tensor.KerasTensor'>>
_overload_operator:<bound method KerasTensor._overload_operator of <class 'tensorflow.python.keras.engine.keras_tensor.KerasTensor'>>
_to_placeholder:<bound method KerasTensor._to_placeholder of <KerasTensor: shape=() dtype=int32 inferred_value=[None] (created by layer 'tf.__operators__.getitem_6')>>
_type_spec:TensorSpec(shape=(), dtype=tf.int32, name=None)
When the error is raised, K.less(idx_outer, timesteps) can be evaluated succesfully:
timesteps == <KerasTensor: shape=() dtype=bool (created by layer 'tf.math.less')>
So I believe the error comes from tf.concat and I'm now trying to replace tf.concat to another operation (e.g. Keras Concatenate layer).
Simpler Example
The following codes work when end = tf.constant(7) but raises
Keras symbolic inputs/outputs do not implement `__len__`. You may be trying to pass Keras symbolic inputs/outputs to a TF API that does not register dispatching, preventing Keras from automatically converting the API call to a lambda layer in the Functional Model. This error will also get raised if you try asserting a symbolic input/output directly.
error at _, final_output = tf.while_loop(cond, body, loop_vars=[step, output]) when end = Input(shape= (), batch_size= 1, name= "timesteps", dtype= tf.int32).
mport tensorflow as tf
from keras.layers import Input
# end = Input(shape= (), batch_size= 1, name= "timesteps", dtype= tf.int32) ## not works :(
end = tf.constant(7) ## works :)
array = tf.Variable([1., 1., 1., 1., 1., 1., 1.])
step = tf.constant(0)
output = tf.TensorArray(dtype=tf.float32, size=0, dynamic_size=True)
def cond(step, output):
return step < end
def body(step, output):
output = output.write(step, tf.gather(array, step))
return step + 1, output
_, final_output = tf.while_loop(cond, body, loop_vars=[step, output])
Try wrapping your logic in a custom layer and using tf operations:
import tensorflow as tf
class CustomLayer(tf.keras.layers.Layer):
def __init__(self):
super(CustomLayer, self).__init__()
def call(self, inputs):
input_shape = tf.shape(inputs)
end = input_shape[-1]
array = tf.ones((input_shape[-1],))
step = tf.constant(0)
output = tf.TensorArray(dtype=tf.float32, size=0, dynamic_size=True)
def cond(step, output):
return step < end
def body(step, output):
output = output.write(step, tf.gather(array, step))
return step + 1, output
_, final_output = tf.while_loop(cond, body, loop_vars=[step, output])
return tf.reshape(final_output.stack(), (input_shape))
inputs = tf.keras.layers.Input(shape= (None, ), batch_size= 1, name= "timesteps", dtype= tf.int32)
cl = CustomLayer()
outputs = cl(inputs)
model = tf.keras.Model(inputs, outputs)
random_data = tf.random.uniform((1, 7), dtype=tf.int32, maxval=50)
print(model(random_data))
tf.Tensor([1. 1. 1. 1. 1. 1. 1.], shape=(7,), dtype=float32)
timesteps_mixed_outer = tf.concat([timesteps_mixed_outer, [timesteps_mixed_inner]], axis = 0)
You have to check the shape of timesteps_mixed_outer and timesteps_mixed_inner. try to change the axis value.
or try this.
timesteps_mixed_outer = tf.concat([timesteps_mixed_outer.numpy(), timesteps_mixed_inner.numpy()], axis = 0)

Tensorflow aggregates scalar-tensor multiplication gradient

Suppose I multiply a vector with a scalar, e.g.:
a = tf.Variable(3.)
b = tf.Variable([1., 0., 1.])
with tf.GradientTape() as tape:
c = a*b
grad = tape.gradient(c, a)
The resulting gradient I get is a scalar,
<tf.Tensor: shape=(), dtype=float32, numpy=2.0>
whereas we would expect the vector:
<tf.Variable 'Variable:0' shape=(3,) dtype=float32, numpy=array([1., 0., 1.], dtype=float32)>
Looking at other examples, it appears that tensorflow sums the expected vector, also for scalar-matrix multiplication and so on.
Why does tensorflow do this? This can probably be avoided using #custum_gradient, is there another less cumbersome way to get the correct gradient?
There are appear to be some related questions but these all seem to consider a the gradient of a loss function that aggregates over a training-batch. No loss function or aggregation is used here, so I think the issue is something else?
You're getting scaler value because you took the gradient wrt scaler. You would get a vector if you took grad wrt some vector. Take a look to the following example:
import tensorflow as tf
a = tf.Variable(3., trainable=True)
b = tf.Variable([1., 0, 1.], trainable=True)
c = tf.Variable(2., trainable=True)
d = tf.Variable([2., 1, 2.], trainable=True)
with tf.GradientTape(persistent=True) as tape:
e = a*b*c*d # abcd , abcd , abcd
tf.print(e)
grad = tape.gradient(e, [a, b, c, d])
grad[0].numpy(), grad[1].numpy(), grad[2].numpy(), grad[3].numpy()
[12 0 12]
(8.0,
array([12., 6., 12.], dtype=float32),
12.0,
array([6., 0., 6.], dtype=float32))
Formally, what I was looking for was the differential of the vector-field that is function of the variable a. For a vector-field the differential is the same as the Jacobian. It turns out that what I was looking for can be done by tape.jacobian.

Keras:How to get weights, make weights become 1D array, then make weights shape become initial shape?

I want to do some experiment. and i need to get Keras model weights, make it 1D array , and make the shape like initial shape
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
model = tf.keras.Sequential()
# Adds a densely-connected layer with 64 units to the model:
model.add(layers.Dense( 4, input_dim = 5 ,activation='relu'))
# Add another:
model.add(layers.Dense(3, activation='relu'))
# Add an output layer with 10 output units:
model.add(layers.Dense(2))
# Configure a model for mean-squared error regression.
model.compile(optimizer=tf.keras.optimizers.Adam(0.01),
loss='mse', # mean squared error
metrics=['mae']) # mean absolute error
weights = (model.get_weights())
#make weight become 1D array
#maka 1D array become like inital shape
model.set_weights(weights)
why iwant to do this ?
because i want to do some mutation using other module, that's necessary to pass 1D array
how to do this ?
as we know the shape of Keras model weights look like this
[array([[-0.24053234, 0.4722855 , 0.29863954, 0.22805429],
[ 0.45101106, -0.00229341, -0.6142864 , -0.2751704 ],
[ 0.159172 , 0.43983865, 0.61577237, 0.24255097],
[ 0.24160242, 0.422235 , 0.8066592 , -0.2711717 ],
[-0.30763668, -0.4841219 , 0.767977 , 0.23558974]],
dtype=float32), array([0., 0., 0., 0.], dtype=float32), array([[ 0.24129152, -0.4890638 , 0.18787515],
[ 0.8663894 , -0.09163451, -0.86416066],
[-0.01754427, 0.32654428, -0.78837514],
[ 0.589849 , 0.5886531 , 0.27824092]], dtype=float32), array([0., 0., 0.], dtype=float32), array([[ 0.8456359 , -0.26292562],
[-1.0447757 , -0.43539298],
[ 1.0835328 , -0.43536085]], dtype=float32), array([0., 0.], dtype=float32)]

stop gradients flowing in a joint loss

I have an input tensor
data = tf.placeholder(tf.int32, [None])
which will be embedded by
embedding_matrix = tf.get_variable("embedding_matrix", [5,3], tf.float32, initializer=tf.random_normal_initializer())
input_vectors = tf.nn.embedding_lookup(params=embedding_matrix, ids=data)
I perform a linear transformation on the input vector using output1_weights to get network_output1
output1_weights = tf.get_variable("output1", [3,4], tf.float32, initializer=tf.random_normal_initializer())
network_output1 = tf.matmul(input_vectors, output1_weights)
The loss will be very standard stuff
loss1 = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=output1, logits=network_output1)
Now I want to use the logits network_output1 as input to compute another linear transformation
output2_weights = tf.get_variable("output2", [4,5], tf.float32, initializer=tf.random_normal_initializer())
network_output2 = tf.matmul(network_output1, output2_weights)
Again cross-entropy loss on the second output
loss2 = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=output2, logits=network_output2)
Here is what I want to achieve. In a joint loss setting I want to only back-prop the gradient of output1_weights when minimizing the loss of loss1 and only the gradient of output2_weights when minimizing loss2. In other words, when optimizing loss2 I don't want the gradients to flow all the back to tamper output1_weights. I am aware of the compute_gradients function in optimizer class which can take an argument var_list, but it seems it can not stop the gradients flowing for separate losses. Also I can consider separating the losses and minimize them individually, which will also be a bad solution in my setting.
All you have to do is select a trainable variable and assign it to var_list.
First count the trainable variables of your different loss.
import numpy as np
import tensorflow as tf
data = tf.placeholder(tf.int32, [None])
output1 = tf.placeholder(tf.int32, [None])
output2 = tf.placeholder(tf.int32, [None])
embedding_matrix = tf.get_variable("embedding_matrix", [5,3], tf.float32, initializer=tf.random_normal_initializer())
input_vectors = tf.nn.embedding_lookup(params=embedding_matrix, ids=data)
# count
params_num0 = len(tf.trainable_variables())
output1_weights = tf.get_variable("output1", [3,4], tf.float32, initializer=tf.random_normal_initializer())
network_output1 = tf.matmul(input_vectors, output1_weights)
# count
params_num1 = len(tf.trainable_variables())
loss1 = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=output1, logits=network_output1)
output2_weights = tf.get_variable("output2", [4,5], tf.float32, initializer=tf.random_normal_initializer())
network_output2 = tf.matmul(network_output1, output2_weights)
loss2 = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=output2, logits=network_output2)
Then print them and all trainable variables.
params = tf.trainable_variables()
print(params_num0)
print(params_num1)
print(params)
# 1
# 2
# [<tf.Variable 'embedding_matrix:0' shape=(5, 3) dtype=float32_ref>, <tf.Variable 'output1:0' shape=(3, 4) dtype=float32_ref>, <tf.Variable 'output2:0' shape=(4, 5) dtype=float32_ref>]
You can see that there are three trainable variables: loss1 for the second, and loss2 for the third.
# if you want to back-prop the gradient of embedding_matrix,
# params1 = params[:params_num1]
# params2 = params[:params_num0] + params[params_num1:]
params1 = params[params_num0:params_num1]
params2 = params[params_num1:]
print(params1)
print(params2)
# [<tf.Variable 'output1:0' shape=(3, 4) dtype=float32_ref>]
# [<tf.Variable 'output2:0' shape=(4, 5) dtype=float32_ref>]
Next, Specify the updated gradient for the corresponding variable.
opt = tf.train.AdamOptimizer(0.01)
grads_vars = opt.compute_gradients(loss1,var_list=params1)
grads_vars2 = opt.compute_gradients(loss2,var_list=params2)
print(grads_vars)
print(grads_vars2)
# [(<tf.Tensor 'gradients/MatMul_grad/tuple/control_dependency_1:0' shape=(3, 4) dtype=float32>, <tf.Variable 'output1:0' shape=(3, 4) dtype=float32_ref>)]
# [(<tf.Tensor 'gradients_1/MatMul_1_grad/tuple/control_dependency_1:0' shape=(4, 5) dtype=float32>, <tf.Variable 'output2:0' shape=(4, 5) dtype=float32_ref>)]
Finally, we can use apply_gradients() to update trainable variable.
train_op = opt.apply_gradients(grads_vars+grads_vars2)
Experiment
data_np = np.random.normal(size=(100))
output1_np = np.random.randint(0,4,size=(100))
output2_np = np.random.randint(0,5,size=(100))
feed_dict_v = {data: data_np, output1: output1_np, output2: output2_np}
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(2):
print("epoch:{}".format(i))
sess.run(train_op, feed_dict=feed_dict_v)
print("embedding_matrix value:\n",sess.run(embedding_matrix, feed_dict=feed_dict_v))
print("output1_weights value:\n",sess.run(output1_weights, feed_dict=feed_dict_v))
print("output2_weights value:\n",sess.run(output2_weights, feed_dict=feed_dict_v))
The result:
epoch:0
embedding_matrix value:
[[ 0.7646786 -0.44221798 -1.6374763 ]
[-0.4061512 -0.70626575 0.09637168]
[ 1.3499098 0.38479885 -0.10424987]
[-1.3999717 0.67008936 1.8843309 ]
[-0.11357951 -1.1893668 1.1205566 ]]
output1_weights value:
[[-0.22709225 0.70598644 0.10429419 -2.2737694 ]
[-0.6364337 -0.08602498 1.9750406 0.8664075 ]
[ 0.3656631 -0.25182125 -0.14689662 -0.03764082]]
output2_weights value:
[[ 0.00554644 -0.49370843 -0.75148153 0.6645286 1.0131303 ]
[ 0.21612553 0.07851358 0.05937392 -0.3236267 -0.8081816 ]
[ 0.82237226 0.17242427 -1.3059226 -1.1134574 0.22402465]
[-1.6996336 -0.58993673 -0.7071007 0.8407903 0.62416744]]
epoch:1
embedding_matrix value:
[[ 0.7646786 -0.44221798 -1.6374763 ]
[-0.4061512 -0.70626575 0.09637168]
[ 1.3499098 0.38479885 -0.10424987]
[-1.3999717 0.67008936 1.8843309 ]
[-0.11357951 -1.1893668 1.1205566 ]]
output1_weights value:
[[-0.21710345 0.6959941 0.11408082 -2.2637703 ]
[-0.64639646 -0.07603455 1.9650643 0.85640883]
[ 0.35567763 -0.24182947 -0.15682784 -0.04763966]]
output2_weights value:
[[ 0.01553426 -0.5036415 -0.7415529 0.65454334 1.003145 ]
[ 0.20613036 0.08847766 0.04942677 -0.31363514 -0.7981894 ]
[ 0.8323502 0.16245098 -1.2959852 -1.1234138 0.21408063]
[-1.6896346 -0.59990865 -0.6971453 0.8307945 0.6141711 ]]
You can see that embedding_matrix has never changed.output1_weights and output2_weights only update the corresponding gradient.
Add
In fact, you can combine loss1 and loss2 on output2_weights. For example:
grads_vars3 = opt.compute_gradients(loss1+loss2,var_list=params2)
You will find that grads_vars2 and grads_vars3 are equal when loss1 and loss2 are combined by addition. The reason is that the gradient of loss1 does not flow to output2_weights in loss1+loss2. But in the following cases, grads_vars2 and grads_vars3 are not equal when loss1 and loss2 are combined by multiplication.
grads_vars3 = opt.compute_gradients(loss1*loss2,var_list=params2)
The above cases mean that we can combine losses for corresponding trainable variables according to our own needs.
In your scenario, network_output2 needs to use network_output1, so we have to specify loss. If network_output2 does not depend on network_output1, we can directly optimize loss1 + loss2.
About gradients
input = tf.constant([[1,2,3]],tf.float32)
label1 = tf.constant([[1,2,3,4]],tf.float32)
label2 = tf.constant([[1,2,3,4,5]],tf.float32)
weight1 = tf.reshape(tf.range(12,dtype=tf.float32),[3,4])
output1 = tf.matmul(input , weight1)
loss1 = tf.reduce_sum(output1 - label1)
weight2 = tf.reshape(tf.range(20,dtype=tf.float32),[4,5])
output2 = tf.matmul(output1 , weight2)
loss2 = tf.reduce_sum(output2 - label2)
grad1 = tf.gradients(loss1,weight1)
grad2 = tf.gradients(loss2,weight2)
grad3 = tf.gradients(loss1+loss2,weight2)
with tf.Session() as sess:
print(sess.run(grad1))
print(sess.run(grad2))
print(sess.run(grad3))
# [array([[1., 1., 1., 1.],
# [2., 2., 2., 2.],
# [3., 3., 3., 3.]], dtype=float32)]
# [array([[32., 32., 32., 32., 32.],
# [38., 38., 38., 38., 38.],
# [44., 44., 44., 44., 44.],
# [50., 50., 50., 50., 50.]], dtype=float32)]
# [array([[32., 32., 32., 32., 32.],
# [38., 38., 38., 38., 38.],
# [44., 44., 44., 44., 44.],
# [50., 50., 50., 50., 50.]], dtype=float32)]

Getting Keras / Tensorflow to output OneHotCategorical, but operation has None for gradient

Problem description
I have inputs x that are indicator variables, and outputs y, where each row is a random one-hot vector that depends on the values of x (data sample shown below).
I want to train a model that essentially learns the probabilistic relationship between x and y in the form of per-column weights. The model must "choose" one, and only one, indicator to output. My current approach is to sample a categorical random variable and produce a one-hot vector as a prediction.
The issue is that I'm getting an error ValueError: An operation has `None` for gradient when I try to train my Keras model.
I find this error odd, because I've trained mixture networks using Keras and Tensorflow, which use tf.contrib.distributions.Categorical, and I did not run into any gradient-related issues.
Code
Experiment
import tensorflow as tf
import tensorflow.contrib.distributions as tfd
import numpy as np
from keras import backend as K
from keras.layers import Layer
from keras.models import Sequential
from keras.utils import to_categorical
def make_xy_prob(rng, size=10000):
rng = np.random.RandomState(rng) if isinstance(rng, int) else rng
cols = 3
weights = np.array([[1, 2, 3]])
# generate data and drop zeros for now
x = rng.choice(2, (size, cols))
is_zeros = x.sum(axis=1) == 0
x = x[~is_zeros]
# use weights to create probabilities for determining y
weighted_x = x * weights
prob_x = weighted_x / weighted_x.sum(axis=1, keepdims=True)
y = np.row_stack([to_categorical(rng.choice(cols, p=p), cols) for p in prob_x])
# add zeros back and shuffle
zeros = np.zeros(((size - len(x), cols)))
x = np.row_stack([x, zeros])
y = np.row_stack([y, zeros])
shuffle_idx = rng.permutation(size)
x = x[shuffle_idx]
y = y[shuffle_idx]
return x, y
class OneHotGate(Layer):
def build(self, input_shape):
self.kernel = self.add_weight(name='kernel', shape=(1, input_shape[1]), initializer='ones')
def call(self, x):
zero_cond = x < 1
x_shape = tf.shape(x)
# weight indicators so that more probability is assigned to more likely columns
weighted_x = x * self.kernel
# fill zeros with -inf so that zero probability is assigned to that column
ninf_fill = tf.fill(x_shape, -np.inf)
masked_x = tf.where(zero_cond, ninf_fill, weighted_x)
onehot_gate = tf.squeeze(tfd.OneHotCategorical(logits=masked_x, dtype=x.dtype).sample(1))
# fill gate with zeros where input was originally zero
zeros_fill = tf.fill(x_shape, 0.0)
masked_gate = tf.where(zero_cond, zeros_fill, onehot_gate)
return masked_gate
def experiment(epochs=10):
K.clear_session()
rng = np.random.RandomState(2)
X, y = make_xy_prob(rng)
input_shape = (X.shape[1], )
model = Sequential()
gate_layer = OneHotGate(input_shape=input_shape)
model.add(gate_layer)
model.compile('adam', 'categorical_crossentropy')
model.fit(X, y, 64, epochs, verbose=1)
Data sample
>>> x
array([[1., 1., 1.],
[0., 1., 0.],
[1., 0., 1.],
...,
[1., 1., 1.],
[1., 1., 1.],
[1., 1., 0.]])
>>> y
array([[0., 0., 1.],
[0., 1., 0.],
[1., 0., 0.],
...,
[0., 0., 1.],
[1., 0., 0.],
[1., 0., 0.]])
Error
ValueError: An operation has `None` for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.
The problem lies in the fact that in OneHotCategorical performs a discontinuous sampling - what causes gradient computation to fail. In order to replace this discontinuous sampling with a continuous (relaxed) version one may try to use RelaxedOneHotCategorical (which is based on interesting Gumbel Softmax technique).

Categories

Resources