In my implementation of a GAN network the output of the discriminator is something like 2.05145e+07 which leads to 1 - disc_output -> 1-2.05145e+07=-2.05145e+07 (a negativ number) therefore log(1-2.05145e+07) leads to NaN.
I am not the first one with this kind of problem. One solution is to only allow positive values inside the log like done here.
Does anyone knows any better solution to this?
maybe some different loss function ?
Because discriminator returns a probability value, its output must be between 0 and 1. Try using sigmoid ( https://www.tensorflow.org/api_docs/python/tf/sigmoid) before using discriminator outputs.
Additionally, as others did, I suggest using tf.log(tf.maximum(x, 1e-9)) in case of a numerical instability.
There are standard techniques to avoid log numerical instability. For example, what you often care about is the loss (which is a function of the log), not the log value itself. For instance, with logistic loss:
For brevity, let x = logits, z = labels. The logistic loss is
z * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x))
= max(x, 0) - x * z + log(1 + exp(-abs(x)))
These tricks are already implemented in standard tensorflow losses (like tf.losses.sigmoid_cross_entropy). Note that the naive solution of taking a max or a min inside of the log is not a good solution, since there aren't meaningful gradients in the saturated regions: for instance, d/dx[max(x, 0)] = 0 for x < 0, which means there won't be gradients in the saturated region.
TensorFlow has GAN support with tf.contrib.gan. These losses already implement all of the standard numerical stability tricks, and an avoid you having to recreate the wheel.
tfgan = tf.contrib.gan
tfgan.losses.minimax_discriminator_loss(...)
See https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/gan for more details.
Related
I am training 2 autoencoders with 2 separate input paths jointly and I would like to randomly set one of the input paths to zero.
I use tensorflow with keras backend (functional API).
I am computing a joint loss (sum of two losses) for backpropagation.
A -> A' & B ->B'
loss => l2(A,A')+l2(B,B')
networks taking A and B are connected in latent space.
I would like to randomly set A or B to zero and compute the loss only on the corresponding path, meaning if input path A is set to zero loss be computed only by using outputs of only path B and vice versa; e.g.:
0 -> A' & B ->B'
loss: l2(B,B')
How do I randomly set input path to zero? How do I write a callback which does this?
Maybe try the following:
import random
def decision(probability):
return random.random() < probability
Define a method that makes a random decision based on a certain probability x and make your loss calculation depend on this decision.
if current_epoch == random.choice(epochs):
keep_mask = tf.ones_like(A.input, dtype=float32)
throw_mask = tf.zeros_like(A.input, dtype=float32)
if decision(probability=0.5):
total_loss = tf.reduce_sum(reconstruction_loss_a * keep_mask
+ reconstruction_loss_b * throw_mask)
else:
total_loss = tf.reduce_sum(reconstruction_loss_a * throw_mask
+ reconstruction_loss_b * keep_mask)
else:
total_loss = tf.reduce_sum(reconstruction_loss_a + reconstruction_loss_b)
I assume that you do not want to set one of the paths to zero every time you update your model parameters, as then there is a risk that one or even both models will not be sufficiently trained. Also note that I use the input of A to create zero_like and one_like tensors as I assume that both inputs have the same shape; if this is not the case, it can easily be adjusted.
Depending on what your goal is, you may also consider replacing your input of A or B with a random tensor e.g. tf.random.normal based on a random decision. This creates noise in your model, which may be desirable, as your model would be forced to look into the latent space to try reconstruct your original input. This means precisely that you still calculate your reconstruction loss with A.input and A.output, but in reality your model never received the A.input, but rather the random tensor.
Note that this answer serves as a simple conceptual example. A working example with Tensorflow can be found here.
You can set an input to 0 simply:
A = A*random.choice([0,1])
This code can be used inside a loss function
I want to calculate L1 loss in a neural network, I came across this example at https://discuss.pytorch.org/t/simple-l2-regularization/139/2, but there are some errors in this code.
Is this really how to calculate L1 Loss in a NN or is there a simpler way?
l1_crit = nn.L1Loss()
reg_loss = 0
for param in model.parameters():
reg_loss += l1_crit(param)
factor = 0.0005
loss += factor * reg_loss
Is this equivalent in any way to simple doing:
loss = torch.nn.L1Loss()
I assume not, because I am not passing along any network parameters. Just checking if there isn existing function to do this.
If I am understanding well, you want to compute the L1 loss of your model (as you say in the begining). However I think you might got confused with the discussion in the pytorch forum.
From what I understand, in the Pytorch forums, and the code you posted, the author is trying to normalize the network weights with L1 regularization. So it is trying to enforce that weights values fall in a sensible range (not too big, not too small). That is weights normalization using L1 normalization (that is why it is using model.parameters()). Normalization takes a value as input and produces a normalized value as output.
Check this for weights normalization: https://pytorch.org/docs/master/generated/torch.nn.utils.weight_norm.html
On the other hand, L1 Loss it is just a way to determine how 2 values differ from each other, so the "loss" is just measure of this difference. In the case of L1 Loss this error is computed with the Mean Absolute Error loss = |x-y| where x and y are the values to compare. So error compute takes 2 values as input and produces a value as output.
Check this for loss computing: https://pytorch.org/docs/master/generated/torch.nn.L1Loss.html
To answer your question: no, the above snippets are not equivalent, since the first is trying to do weights normalization and the second one, you are trying to compute a loss. This would be the loss computing with some context:
sample, target = dataset[i]
target_predicted = model(sample)
loss = torch.nn.L1Loss()
loss_value = loss(target, target_predicted)
I'm having trouble using tfp.layers.DistributionLambda, I'm a TF newbie trying hard to make the tensors flow. Can someone please provide some insights into how to set up the output distribution's parameters?
Context:
TFP team wrote a tutorial on Regression with Probabilistic Layers in TensorFlow Probability, it set up the following model:
# Build model.
model = tfk.Sequential([
tf.keras.layers.Dense(1 + 1),
tfp.layers.DistributionLambda(
lambda t: tfd.Normal(loc=t[..., :1],
scale=1e-3 + tf.math.softplus(0.05 * t[..., 1:]))),
])
My problem:
It outputs a normal distribution using tfp.layers.DistributionLambda, but I'm unclear how tfd.Normal's parameters (mean/loc and standard deviation/scale) were set up, so I'm having trouble changing the Normal to a Gamma Distribution. I tried the following, but didn't work (predicted distribution parameters are nan).
def dist_output_layer (t, softplus_scale=0.05):
"""Create distribution with variable mean and variance
"""
mean = t[..., :1]
std_dev = 1e-3 + tf.math.softplus(softplus_scale * mean)
alpha = (mean/std_dev)**2
beta = alpha/mean
return tfd.Gamma(concentration = alpha,
rate = beta
)
# Build model.
model = tf.keras.Sequential([
tf.keras.layers.Dense(20,activation="relu"), # "By using a deeper neural network and introducing nonlinear activation functions, however, we can learn more complicated functional dependencies!
tf.keras.layers.Dense(1 + 1), #two neurons here b/c the output layer's distribution's mean and std. deviation
tfp.layers.DistributionLambda(dist_output_layer)
])
Thanks a lot in advance.
There is a a lot to say about the code snippet you pasted from Medium, to be honest.
I hope you will find my comments below somewhat useful, though.
# Build model.
model = tfk.Sequential([
# The first layer is a Dense layer with 2 units, one for each of the parameters that will
# be learnt (see next layer). Its implied shape is (batch_size, 2).
# Note that this Dense layer has no activation function as we want are any real value that will be used
# to parameterize the Normal distribution in the Normal distribution component of the following
# layer
tf.keras.layers.Dense(1 + 1),
# The following layer is a DistributionLambda that encapsulates a Normal distribution. The
# DistributionLambda takes a function in its constructor, and this function should take the output
# tensor from the previous layer as its input (this is the Dense layer and the comments above).
# The goal is to learn the 2 parameters of the distribution that is loc (the mean) and scale (the standard
# deviation). For this, a lambda construct is used. The ellipsis you can see for the loc
# and scale arguments (that is the 3 dots) are for the batch size. Also note that scale (the standard deviation)
# cannot be negative. The softplus function was used to make sure that the learnt parameter scale doesn't get
# negative.
tfp.layers.DistributionLambda(
lambda t: tfd.Normal(loc=t[..., :1],
scale=1e-3 + tf.math.softplus(0.05 * t[..., 1:]))),
])
Regarding the question about the .05 being added, it's a small offset to solve some gradient issues that can arise without it. Basically a prior saying we're confident that the real variability is NOT smaller than epsilon (here .05), so we're gonna make sure that the std dev is never smaller by just adding that.
See https://github.com/tensorflow/probability/issues/751
Money quote:
"If infinitesimal scales end up being a problem in practice on a given task, the fix we commonly use is a softplus-and-shift, e.g. scale = epsilon + tf.math.softplus(unconstrained_scale), where epsilon is some tiny value like 1e-5 that we are a priori confident is much smaller than the true scale."
EDIT: Actually what is added is 1e-3 for the reasons I described above. As for the multiplication.... might again just be a scaling or gradient adjustment. Or perhaps to make the scale parameter begin at a certain size.
I'd like to use a neural network to predict a scalar value which is the sum of a function of the input values and a random value (I'm assuming gaussian distribution) whose variance also depends on the input values. Now I'd like to have a neural network that has two outputs - the first output should approximate the deterministic part - the function, and the second output should approximate the variance of the random part, depending on the input values. What loss function do I need to train such a network?
(It would be nice if there was an example with Python for Tensorflow, but I'm also interested in general answers. I'm also not quite clear how I could write something like in Python code - none of the examples I found so far show how to address individual outputs from the loss function.)
You can use dropout for that. With a dropout layer you can make several different predictions based on different settings of which nodes dropped out. Then you can simply count the outcomes and interpret the result as a measure for uncertainty.
For details, read:
Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. 2016.
Since I've found nothing simple to implement, I wrote something myself, that models that explicitly: here is a custom loss function that tries to predict mean and variance. It seems to work but I'm not quite sure how well that works out in practice, and I'd appreciate feedback. This is my loss function:
def meanAndVariance(y_true: tf.Tensor , y_pred: tf.Tensor) -> tf.Tensor :
"""Loss function that has the values of the last axis in y_true
approximate the mean and variance of each value in the last axis of y_pred."""
y_pred = tf.convert_to_tensor(y_pred)
y_true = math_ops.cast(y_true, y_pred.dtype)
mean = y_pred[..., 0::2]
variance = y_pred[..., 1::2]
res = K.square(mean - y_true) + K.square(variance - K.square(mean - y_true))
return K.mean(res, axis=-1)
The output dimension is twice the label dimension - mean and variance of each value in the label. The loss function consists of two parts: a mean squared error that has the mean approximate the mean of the label value, and the variance that approximates the difference of the value from the predicted mean.
When using dropout to estimate the uncertainty (or any other stochastic regularization method), make sure to also checkout our recent work on providing a sampling-free approximation of Monte-Carlo dropout.
https://arxiv.org/pdf/1908.00598.pdf
We essentially follow ur idea. Treat the activations as random variables and then propagate mean and variance using error propagation to the output layer. Consequently, we obtain two outputs - the mean and the variance.
I want to train my neural network (in Keras) with an additional condition on the output elements.
An example:
Minimize my loss function MSE between network output y_pred and y_true.
Additionally, ensure that the norm of y_pred is less or equal 1.
Without the condition, the task is straightforward.
Note: The condition is not necessarily the vector norm of y_pred.
How can I implement the additional condition/restriction in a Keras (or maybe Tensorflow) model?
In principle, tensorflow (and keras) don't allow you to add hard constraints to your model.
You have to convert your invarient (norm <= 1) to a penalty function, which is added to the loss. This could look like this:
y_norm = tf.norm(y_pred)
norm_loss = tf.where(y_norm > 1, y_norm, 0)
total_loss = mse + norm_loss
Look at the docs of where. If your prediction has a norm bigger than one, backpropagation tries to minimize the norm. If it is less than or equal, this part of the loss is simply 0. No gradient is produced.
But this can be very hard to optimize. Your predictions could oscillate around a norm of 1. It is also possible to add a factor: total_loss = mse + 1000* norm_loss. Be very careful with this, it makes optimization even harder.
In the example above, the norm above one contributes linearly to the loss. This is called l1-regularization. You could also square it, which would become l2-regularization.
In your specific case, you could get creative. Why not normalize your predictions and the targets to one (just a suggestion, might be a bad idea)?
loss = mse(y_pred / tf.norm(y_pred), y_target / np.linalg.norm(y_target)