I'm trying to create N x N tensor using tf.while_loop in my custom Keras layer. Here, N (timesteps in code) is a Keras symbolic tensor (integer scalar). The below code is __call__ method of my custom Keras layer in Functional Model.
import tensorflow as tf
from keras import backend as K
# timesteps = tf.constant(7) ## This makes this code work!!
timesteps = K.shape(inputs)[1] ## Or equivalently provided by timesteps = keras.layers.Input(shape= (), batch_size= 1, name= "timesteps")
# timesteps = tf.convert_to_tensor(timesteps) ## Does not work.
idx_outer = tf.constant(0)
timesteps_mixed_outer = tf.reshape(tf.Variable([]), (0, timesteps))
# timesteps_mixed_outer = Lambda(lambda timesteps : tf.reshape(tf.Variable([]), (0, timesteps)))(timesteps) ## Does not work
def body_inner(idx_inner, idx_outer, timesteps_mixed_inner):
timesteps_mixed_inner = tf.concat([timesteps_mixed_inner, [tf.cond(idx_inner == idx_outer, lambda : True, lambda : False)]], axis = 0)
return idx_inner + 1, idx_outer, timesteps_mixed_inner
def body_outer(idx_outer, timesteps_mixed_outer):
timesteps_mixed_inner = tf.Variable([])
idx_inner = tf.constant(0)
idx_inner, idx_outer, timesteps_mixed_inner = tf.while_loop(lambda idx_inner, idx_outer, timesteps_mixed_inner: K.less(idx_inner, timesteps), body_inner, [idx_inner, idx_outer, timesteps_mixed_inner], shape_invariants= [idx_inner.get_shape(), idx_outer.get_shape(), tf.TensorShape([None])])
timesteps_mixed_outer = tf.concat([timesteps_mixed_outer, [timesteps_mixed_inner]], axis = 0)
return idx_outer + 1, timesteps_mixed_outer
idx_outer, timesteps_mixed_outer = tf.while_loop(lambda idx_outer, timesteps_mixed_outer: K.less(idx_outer, timesteps), body_outer, [idx_outer, timesteps_mixed_outer], shape_invariants= [idx_outer.get_shape(), tf.TensorShape([None, None])]) ## Here raises error
The last line of above code raises the following error:
Exception has occurred: TypeError
Could not build a TypeSpec for <KerasTensor: shape=(0, None) dtype=float32 (created by layer 'tf.reshape')> with type KerasTensor
What I have tried:
I suspected the problem is came from Keras symbolic tensor input 'timesteps', so I have changed to timesteps = tf.constant(7) for experimental purpose. Then the code works and 'timesteps_mixed_outer' has the desired values:
<tf.Tensor: shape=(7, 7), dtype=float32, numpy=
array([[1., 0., 0., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0., 0., 1.]], dtype=float32)>
I suspected the problem comes the use of from Keras symbolic tensor timesteps in tf.reshape function, so I have initialized timesteps_mixed_outer = tf.reshape(tf.Variable([]), (0, 7)) and leave timesteps = K.shape(inputs)[1]. Then new error occurs:
Exception has occurred: TypeError
Keras symbolic inputs/outputs do not implement `__len__`. You may be trying to pass Keras symbolic inputs/outputs to a TF API that does not register dispatching, preventing Keras from automatically converting the API call to a lambda layer in the Functional Model. This error will also get raised if you try asserting a symbolic input/output directly.
I have also tried to wrap tf.reshape following two solutions suggested in TypeError: Could not build a TypeSpec for <KerasTensor when using tf.map_fn and keras functional model, but both raise the same error.
My environments is as follows:
MacOS 12.0.1
Python 3.7.3
keras-preprocessing [installed: 1.1.2]
keras.__version__ == 2.4.3
tensorflow [installed: 2.4.1]
tensorflow-estimator [installed: 2.4.0]
This error is raised when I build Keras model, before feeding actual Numpy values. timesteps = K.shape(inputs)[1] is varying across inputs, so it is set to None as like a batch dimension.
timesteps = K.shape(inputs)[1]
<KerasTensor: shape=() dtype=int32 inferred_value=[None] (created by layer 'tf.__operators__.getitem_6')>
op:'Traceback (most recent call last):\n File "/Users/imgspoints/.vscode/extensions/ms-python.python-2022.2.1924087327/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/", line 193, in _get_py_dictionary\n attr = getattr(var, name)\n File "/Users/imgspoints/.local/share/virtualenvs/experiments-m6CLaaa4/lib/python3.7/site-packages/tensorflow/python/keras/engine/", line 251, in op\n raise TypeError(\'Keras symbolic inputs/outputs do not \'\nTypeError: Keras symbolic inputs/outputs do not implement `op`. You may be trying to pass Keras symbolic inputs/outputs to a TF API that does not register dispatching, preventing Keras from automatically converting the API call to a lambda layer in the Functional Model.\n'
type_spec:TensorSpec(shape=(), dtype=tf.int32, name=None)
_keras_history:KerasHistory(layer=<tensorflow.python.keras.layers.core.SlicingOpLambda object at 0x1774fac88>, node_index=0, tensor_index=0)
_overload_all_operators:<bound method KerasTensor._overload_all_operators of <class 'tensorflow.python.keras.engine.keras_tensor.KerasTensor'>>
_overload_operator:<bound method KerasTensor._overload_operator of <class 'tensorflow.python.keras.engine.keras_tensor.KerasTensor'>>
_to_placeholder:<bound method KerasTensor._to_placeholder of <KerasTensor: shape=() dtype=int32 inferred_value=[None] (created by layer 'tf.__operators__.getitem_6')>>
_type_spec:TensorSpec(shape=(), dtype=tf.int32, name=None)
When the error is raised, K.less(idx_outer, timesteps) can be evaluated succesfully:
timesteps == <KerasTensor: shape=() dtype=bool (created by layer 'tf.math.less')>
So I believe the error comes from tf.concat and I'm now trying to replace tf.concat to another operation (e.g. Keras Concatenate layer).
Simpler Example
The following codes work when end = tf.constant(7) but raises
Keras symbolic inputs/outputs do not implement `__len__`. You may be trying to pass Keras symbolic inputs/outputs to a TF API that does not register dispatching, preventing Keras from automatically converting the API call to a lambda layer in the Functional Model. This error will also get raised if you try asserting a symbolic input/output directly.
error at _, final_output = tf.while_loop(cond, body, loop_vars=[step, output]) when end = Input(shape= (), batch_size= 1, name= "timesteps", dtype= tf.int32).
mport tensorflow as tf
from keras.layers import Input
# end = Input(shape= (), batch_size= 1, name= "timesteps", dtype= tf.int32) ## not works :(
end = tf.constant(7) ## works :)
array = tf.Variable([1., 1., 1., 1., 1., 1., 1.])
step = tf.constant(0)
output = tf.TensorArray(dtype=tf.float32, size=0, dynamic_size=True)
def cond(step, output):
return step < end
def body(step, output):
output = output.write(step, tf.gather(array, step))
return step + 1, output
_, final_output = tf.while_loop(cond, body, loop_vars=[step, output])
Try wrapping your logic in a custom layer and using tf operations:
import tensorflow as tf
class CustomLayer(tf.keras.layers.Layer):
def __init__(self):
super(CustomLayer, self).__init__()
def call(self, inputs):
input_shape = tf.shape(inputs)
end = input_shape[-1]
array = tf.ones((input_shape[-1],))
step = tf.constant(0)
output = tf.TensorArray(dtype=tf.float32, size=0, dynamic_size=True)
def cond(step, output):
return step < end
def body(step, output):
output = output.write(step, tf.gather(array, step))
return step + 1, output
_, final_output = tf.while_loop(cond, body, loop_vars=[step, output])
return tf.reshape(final_output.stack(), (input_shape))
inputs = tf.keras.layers.Input(shape= (None, ), batch_size= 1, name= "timesteps", dtype= tf.int32)
cl = CustomLayer()
outputs = cl(inputs)
model = tf.keras.Model(inputs, outputs)
random_data = tf.random.uniform((1, 7), dtype=tf.int32, maxval=50)
tf.Tensor([1. 1. 1. 1. 1. 1. 1.], shape=(7,), dtype=float32)
timesteps_mixed_outer = tf.concat([timesteps_mixed_outer, [timesteps_mixed_inner]], axis = 0)
You have to check the shape of timesteps_mixed_outer and timesteps_mixed_inner. try to change the axis value.
or try this.
timesteps_mixed_outer = tf.concat([timesteps_mixed_outer.numpy(), timesteps_mixed_inner.numpy()], axis = 0)
Suppose I multiply a vector with a scalar, e.g.:
a = tf.Variable(3.)
b = tf.Variable([1., 0., 1.])
with tf.GradientTape() as tape:
c = a*b
grad = tape.gradient(c, a)
The resulting gradient I get is a scalar,
<tf.Tensor: shape=(), dtype=float32, numpy=2.0>
whereas we would expect the vector:
<tf.Variable 'Variable:0' shape=(3,) dtype=float32, numpy=array([1., 0., 1.], dtype=float32)>
Looking at other examples, it appears that tensorflow sums the expected vector, also for scalar-matrix multiplication and so on.
Why does tensorflow do this? This can probably be avoided using #custum_gradient, is there another less cumbersome way to get the correct gradient?
There are appear to be some related questions but these all seem to consider a the gradient of a loss function that aggregates over a training-batch. No loss function or aggregation is used here, so I think the issue is something else?
You're getting scaler value because you took the gradient wrt scaler. You would get a vector if you took grad wrt some vector. Take a look to the following example:
import tensorflow as tf
a = tf.Variable(3., trainable=True)
b = tf.Variable([1., 0, 1.], trainable=True)
c = tf.Variable(2., trainable=True)
d = tf.Variable([2., 1, 2.], trainable=True)
with tf.GradientTape(persistent=True) as tape:
e = a*b*c*d # abcd , abcd , abcd
grad = tape.gradient(e, [a, b, c, d])
grad[0].numpy(), grad[1].numpy(), grad[2].numpy(), grad[3].numpy()
[12 0 12]
array([12., 6., 12.], dtype=float32),
array([6., 0., 6.], dtype=float32))
Formally, what I was looking for was the differential of the vector-field that is function of the variable a. For a vector-field the differential is the same as the Jacobian. It turns out that what I was looking for can be done by tape.jacobian.
I want to do some experiment. and i need to get Keras model weights, make it 1D array , and make the shape like initial shape
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
model = tf.keras.Sequential()
# Adds a densely-connected layer with 64 units to the model:
model.add(layers.Dense( 4, input_dim = 5 ,activation='relu'))
# Add another:
model.add(layers.Dense(3, activation='relu'))
# Add an output layer with 10 output units:
# Configure a model for mean-squared error regression.
loss='mse', # mean squared error
metrics=['mae']) # mean absolute error
weights = (model.get_weights())
#make weight become 1D array
#maka 1D array become like inital shape
why iwant to do this ?
because i want to do some mutation using other module, that's necessary to pass 1D array
how to do this ?
as we know the shape of Keras model weights look like this
[array([[-0.24053234, 0.4722855 , 0.29863954, 0.22805429],
[ 0.45101106, -0.00229341, -0.6142864 , -0.2751704 ],
[ 0.159172 , 0.43983865, 0.61577237, 0.24255097],
[ 0.24160242, 0.422235 , 0.8066592 , -0.2711717 ],
[-0.30763668, -0.4841219 , 0.767977 , 0.23558974]],
dtype=float32), array([0., 0., 0., 0.], dtype=float32), array([[ 0.24129152, -0.4890638 , 0.18787515],
[ 0.8663894 , -0.09163451, -0.86416066],
[-0.01754427, 0.32654428, -0.78837514],
[ 0.589849 , 0.5886531 , 0.27824092]], dtype=float32), array([0., 0., 0.], dtype=float32), array([[ 0.8456359 , -0.26292562],
[-1.0447757 , -0.43539298],
[ 1.0835328 , -0.43536085]], dtype=float32), array([0., 0.], dtype=float32)]
I have an input tensor
data = tf.placeholder(tf.int32, [None])
which will be embedded by
embedding_matrix = tf.get_variable("embedding_matrix", [5,3], tf.float32, initializer=tf.random_normal_initializer())
input_vectors = tf.nn.embedding_lookup(params=embedding_matrix, ids=data)
I perform a linear transformation on the input vector using output1_weights to get network_output1
output1_weights = tf.get_variable("output1", [3,4], tf.float32, initializer=tf.random_normal_initializer())
network_output1 = tf.matmul(input_vectors, output1_weights)
The loss will be very standard stuff
loss1 = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=output1, logits=network_output1)
Now I want to use the logits network_output1 as input to compute another linear transformation
output2_weights = tf.get_variable("output2", [4,5], tf.float32, initializer=tf.random_normal_initializer())
network_output2 = tf.matmul(network_output1, output2_weights)
Again cross-entropy loss on the second output
loss2 = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=output2, logits=network_output2)
Here is what I want to achieve. In a joint loss setting I want to only back-prop the gradient of output1_weights when minimizing the loss of loss1 and only the gradient of output2_weights when minimizing loss2. In other words, when optimizing loss2 I don't want the gradients to flow all the back to tamper output1_weights. I am aware of the compute_gradients function in optimizer class which can take an argument var_list, but it seems it can not stop the gradients flowing for separate losses. Also I can consider separating the losses and minimize them individually, which will also be a bad solution in my setting.
All you have to do is select a trainable variable and assign it to var_list.
First count the trainable variables of your different loss.
import numpy as np
import tensorflow as tf
data = tf.placeholder(tf.int32, [None])
output1 = tf.placeholder(tf.int32, [None])
output2 = tf.placeholder(tf.int32, [None])
embedding_matrix = tf.get_variable("embedding_matrix", [5,3], tf.float32, initializer=tf.random_normal_initializer())
input_vectors = tf.nn.embedding_lookup(params=embedding_matrix, ids=data)
# count
params_num0 = len(tf.trainable_variables())
output1_weights = tf.get_variable("output1", [3,4], tf.float32, initializer=tf.random_normal_initializer())
network_output1 = tf.matmul(input_vectors, output1_weights)
# count
params_num1 = len(tf.trainable_variables())
loss1 = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=output1, logits=network_output1)
output2_weights = tf.get_variable("output2", [4,5], tf.float32, initializer=tf.random_normal_initializer())
network_output2 = tf.matmul(network_output1, output2_weights)
loss2 = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=output2, logits=network_output2)
Then print them and all trainable variables.
params = tf.trainable_variables()
# 1
# 2
# [<tf.Variable 'embedding_matrix:0' shape=(5, 3) dtype=float32_ref>, <tf.Variable 'output1:0' shape=(3, 4) dtype=float32_ref>, <tf.Variable 'output2:0' shape=(4, 5) dtype=float32_ref>]
You can see that there are three trainable variables: loss1 for the second, and loss2 for the third.
# if you want to back-prop the gradient of embedding_matrix,
# params1 = params[:params_num1]
# params2 = params[:params_num0] + params[params_num1:]
params1 = params[params_num0:params_num1]
params2 = params[params_num1:]
# [<tf.Variable 'output1:0' shape=(3, 4) dtype=float32_ref>]
# [<tf.Variable 'output2:0' shape=(4, 5) dtype=float32_ref>]
Next, Specify the updated gradient for the corresponding variable.
opt = tf.train.AdamOptimizer(0.01)
grads_vars = opt.compute_gradients(loss1,var_list=params1)
grads_vars2 = opt.compute_gradients(loss2,var_list=params2)
# [(<tf.Tensor 'gradients/MatMul_grad/tuple/control_dependency_1:0' shape=(3, 4) dtype=float32>, <tf.Variable 'output1:0' shape=(3, 4) dtype=float32_ref>)]
# [(<tf.Tensor 'gradients_1/MatMul_1_grad/tuple/control_dependency_1:0' shape=(4, 5) dtype=float32>, <tf.Variable 'output2:0' shape=(4, 5) dtype=float32_ref>)]
Finally, we can use apply_gradients() to update trainable variable.
train_op = opt.apply_gradients(grads_vars+grads_vars2)
data_np = np.random.normal(size=(100))
output1_np = np.random.randint(0,4,size=(100))
output2_np = np.random.randint(0,5,size=(100))
feed_dict_v = {data: data_np, output1: output1_np, output2: output2_np}
with tf.Session() as sess:
for i in range(2):
print("epoch:{}".format(i)), feed_dict=feed_dict_v)
print("embedding_matrix value:\n",, feed_dict=feed_dict_v))
print("output1_weights value:\n",, feed_dict=feed_dict_v))
print("output2_weights value:\n",, feed_dict=feed_dict_v))
The result:
embedding_matrix value:
[[ 0.7646786 -0.44221798 -1.6374763 ]
[-0.4061512 -0.70626575 0.09637168]
[ 1.3499098 0.38479885 -0.10424987]
[-1.3999717 0.67008936 1.8843309 ]
[-0.11357951 -1.1893668 1.1205566 ]]
output1_weights value:
[[-0.22709225 0.70598644 0.10429419 -2.2737694 ]
[-0.6364337 -0.08602498 1.9750406 0.8664075 ]
[ 0.3656631 -0.25182125 -0.14689662 -0.03764082]]
output2_weights value:
[[ 0.00554644 -0.49370843 -0.75148153 0.6645286 1.0131303 ]
[ 0.21612553 0.07851358 0.05937392 -0.3236267 -0.8081816 ]
[ 0.82237226 0.17242427 -1.3059226 -1.1134574 0.22402465]
[-1.6996336 -0.58993673 -0.7071007 0.8407903 0.62416744]]
embedding_matrix value:
[[ 0.7646786 -0.44221798 -1.6374763 ]
[-0.4061512 -0.70626575 0.09637168]
[ 1.3499098 0.38479885 -0.10424987]
[-1.3999717 0.67008936 1.8843309 ]
[-0.11357951 -1.1893668 1.1205566 ]]
output1_weights value:
[[-0.21710345 0.6959941 0.11408082 -2.2637703 ]
[-0.64639646 -0.07603455 1.9650643 0.85640883]
[ 0.35567763 -0.24182947 -0.15682784 -0.04763966]]
output2_weights value:
[[ 0.01553426 -0.5036415 -0.7415529 0.65454334 1.003145 ]
[ 0.20613036 0.08847766 0.04942677 -0.31363514 -0.7981894 ]
[ 0.8323502 0.16245098 -1.2959852 -1.1234138 0.21408063]
[-1.6896346 -0.59990865 -0.6971453 0.8307945 0.6141711 ]]
You can see that embedding_matrix has never changed.output1_weights and output2_weights only update the corresponding gradient.
In fact, you can combine loss1 and loss2 on output2_weights. For example:
grads_vars3 = opt.compute_gradients(loss1+loss2,var_list=params2)
You will find that grads_vars2 and grads_vars3 are equal when loss1 and loss2 are combined by addition. The reason is that the gradient of loss1 does not flow to output2_weights in loss1+loss2. But in the following cases, grads_vars2 and grads_vars3 are not equal when loss1 and loss2 are combined by multiplication.
grads_vars3 = opt.compute_gradients(loss1*loss2,var_list=params2)
The above cases mean that we can combine losses for corresponding trainable variables according to our own needs.
In your scenario, network_output2 needs to use network_output1, so we have to specify loss. If network_output2 does not depend on network_output1, we can directly optimize loss1 + loss2.
About gradients
input = tf.constant([[1,2,3]],tf.float32)
label1 = tf.constant([[1,2,3,4]],tf.float32)
label2 = tf.constant([[1,2,3,4,5]],tf.float32)
weight1 = tf.reshape(tf.range(12,dtype=tf.float32),[3,4])
output1 = tf.matmul(input , weight1)
loss1 = tf.reduce_sum(output1 - label1)
weight2 = tf.reshape(tf.range(20,dtype=tf.float32),[4,5])
output2 = tf.matmul(output1 , weight2)
loss2 = tf.reduce_sum(output2 - label2)
grad1 = tf.gradients(loss1,weight1)
grad2 = tf.gradients(loss2,weight2)
grad3 = tf.gradients(loss1+loss2,weight2)
with tf.Session() as sess:
# [array([[1., 1., 1., 1.],
# [2., 2., 2., 2.],
# [3., 3., 3., 3.]], dtype=float32)]
# [array([[32., 32., 32., 32., 32.],
# [38., 38., 38., 38., 38.],
# [44., 44., 44., 44., 44.],
# [50., 50., 50., 50., 50.]], dtype=float32)]
# [array([[32., 32., 32., 32., 32.],
# [38., 38., 38., 38., 38.],
# [44., 44., 44., 44., 44.],
# [50., 50., 50., 50., 50.]], dtype=float32)]
Problem description
I have inputs x that are indicator variables, and outputs y, where each row is a random one-hot vector that depends on the values of x (data sample shown below).
I want to train a model that essentially learns the probabilistic relationship between x and y in the form of per-column weights. The model must "choose" one, and only one, indicator to output. My current approach is to sample a categorical random variable and produce a one-hot vector as a prediction.
The issue is that I'm getting an error ValueError: An operation has `None` for gradient when I try to train my Keras model.
I find this error odd, because I've trained mixture networks using Keras and Tensorflow, which use tf.contrib.distributions.Categorical, and I did not run into any gradient-related issues.
import tensorflow as tf
import tensorflow.contrib.distributions as tfd
import numpy as np
from keras import backend as K
from keras.layers import Layer
from keras.models import Sequential
from keras.utils import to_categorical
def make_xy_prob(rng, size=10000):
rng = np.random.RandomState(rng) if isinstance(rng, int) else rng
cols = 3
weights = np.array([[1, 2, 3]])
# generate data and drop zeros for now
x = rng.choice(2, (size, cols))
is_zeros = x.sum(axis=1) == 0
x = x[~is_zeros]
# use weights to create probabilities for determining y
weighted_x = x * weights
prob_x = weighted_x / weighted_x.sum(axis=1, keepdims=True)
y = np.row_stack([to_categorical(rng.choice(cols, p=p), cols) for p in prob_x])
# add zeros back and shuffle
zeros = np.zeros(((size - len(x), cols)))
x = np.row_stack([x, zeros])
y = np.row_stack([y, zeros])
shuffle_idx = rng.permutation(size)
x = x[shuffle_idx]
y = y[shuffle_idx]
return x, y
class OneHotGate(Layer):
def build(self, input_shape):
self.kernel = self.add_weight(name='kernel', shape=(1, input_shape[1]), initializer='ones')
def call(self, x):
zero_cond = x < 1
x_shape = tf.shape(x)
# weight indicators so that more probability is assigned to more likely columns
weighted_x = x * self.kernel
# fill zeros with -inf so that zero probability is assigned to that column
ninf_fill = tf.fill(x_shape, -np.inf)
masked_x = tf.where(zero_cond, ninf_fill, weighted_x)
onehot_gate = tf.squeeze(tfd.OneHotCategorical(logits=masked_x, dtype=x.dtype).sample(1))
# fill gate with zeros where input was originally zero
zeros_fill = tf.fill(x_shape, 0.0)
masked_gate = tf.where(zero_cond, zeros_fill, onehot_gate)
return masked_gate
def experiment(epochs=10):
rng = np.random.RandomState(2)
X, y = make_xy_prob(rng)
input_shape = (X.shape[1], )
model = Sequential()
gate_layer = OneHotGate(input_shape=input_shape)
model.compile('adam', 'categorical_crossentropy'), y, 64, epochs, verbose=1)
Data sample
>>> x
array([[1., 1., 1.],
[0., 1., 0.],
[1., 0., 1.],
[1., 1., 1.],
[1., 1., 1.],
[1., 1., 0.]])
>>> y
array([[0., 0., 1.],
[0., 1., 0.],
[1., 0., 0.],
[0., 0., 1.],
[1., 0., 0.],
[1., 0., 0.]])
ValueError: An operation has `None` for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.
The problem lies in the fact that in OneHotCategorical performs a discontinuous sampling - what causes gradient computation to fail. In order to replace this discontinuous sampling with a continuous (relaxed) version one may try to use RelaxedOneHotCategorical (which is based on interesting Gumbel Softmax technique).