Ignore/Mask certain labels Softmax Cross-entropy

Ignore/Mask certain labels Softmax Cross-entropy - python

In trying implement the cross entropy loss with l2 regularization described here A Fast and Accurate Dependency Parser using Neural Networks, I get the error: ValueError: Cannot feed value of shape (48,) for Tensor u'Placeholder_2:0', which has shape '(48, 1) after getting a negative loss for a few steps in training. My loss is as follows:
self.loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=self.prediction, labels=self.train_labels))
+ 1e-8 * (tf.nn.l2_loss(w1)
+ tf.nn.l2_loss(w2))
and is a negative value at step 0. I believe my problem is that some labels are -1 and should be ignored as stated in the paper: "A slight
variation is that we compute the softmax probabilities
only among the feasible transitions in
practice". How would I go about ignoring these labels when calculating the loss?

Related

Measuring incertainty in Bayesian Neural Network

Hy everybody,
I'm beginning with tensorflow probability and I have some difficulties to interpret my Bayesian neural network outputs.
I'm working on a regression case, and started with the example provided by tensorflow notebook here: https://blog.tensorflow.org/2019/03/regression-with-probabilistic-layers-in.html?hl=fr
As I seek to know the uncertainty of my network predictions, I dived directly into example 4 with Aleatoric & Epistemic Uncertainty. You can find my code bellow:
def negative_loglikelihood(targets, estimated_distribution):
return -estimated_distribution.log_prob(targets)
def posterior_mean_field(kernel_size, bias_size, dtype=None):
n = kernel_size + bias_size #number of total paramaeters (Weights and Bias)
c = np.log(np.expm1(1.))
return tf.keras.Sequential([
tfp.layers.VariableLayer(2 * n, dtype=dtype, initializer=lambda shape, dtype: random_gaussian_initializer(shape, dtype), trainable=True),
tfp.layers.DistributionLambda(lambda t: tfd.Independent(
# The Normal distribution with location loc and scale parameters.
tfd.Normal(loc=t[..., :n],
scale=1e-5 +0.01*tf.nn.softplus(c + t[..., n:])),
reinterpreted_batch_ndims=1)),
])
def prior(kernel_size, bias_size, dtype=None):
n = kernel_size + bias_size
return tf.keras.Sequential([
tfp.layers.VariableLayer(n, dtype=dtype),
tfp.layers.DistributionLambda(lambda t: tfd.Independent(
tfd.Normal(loc=t, scale=1),
reinterpreted_batch_ndims=1)),
])
def build_model(param):
model = keras.Sequential()
for i in range(param["n_layers"] ):
name="n_units_l"+str(i)
num_hidden = param[name]
model.add(tfp.layers.DenseVariational(units=num_hidden, make_prior_fn=prior,make_posterior_fn=posterior_mean_field,kl_weight=1/len(X_train),activation="relu"))
model.add(tfp.layers.DenseVariational(units=2, make_prior_fn=prior,make_posterior_fn=posterior_mean_field,activation="relu",kl_weight=1/len(X_train)))
model.add(tfp.layers.DistributionLambda(lambda t: tfd.Normal(loc=t[..., :1],scale=1e-3 + tf.math.softplus(0.01 * t[...,1:]))))
lr = param["learning_rate"]
optimizer=optimizers.Adam(learning_rate=lr)
model.compile(
loss=negative_loglikelihood, #negative_loglikelihood,
optimizer=optimizer,
metrics=[keras.metrics.RootMeanSquaredError()],
)
return model
I think I have the same network than in tfp example, I just added few hidden layers with differents units. Also I added 0.01 in front of the Softplus in the posterior as suggested here, which allows the network to come up to good performances.
Not able to get reasonable results from DenseVariational
The performances of the model are very good (less than 1% of error) but I have some questions:
As Bayesian neural networks "promise" to mesure the uncertainty of the predictions, I was expecting bigger errors on high variance predictions. I ploted the absolute error versus variance and the results are not good enough on my mind. Of course, the model is better at low variance but I can have really bad predicitions at low variance, and therefore cannot really use standard deviation to filter bad predictions. Why is my Bayesian neural netowrk struggling to give me the uncertainty ?
The previous network was train 2000 epochs and we can notice a strange phenome with a vertical bar on lowest stdv. If I increase the number of epoch up to 25000, my results get better either on training and validation set.
But the phenomene of vertical bar that we may notice on the figure 1 is much more obvious. It seems that as much as I increase the number or EPOCH, all output variance converge to 0.68. Is that a case of overfitting ? Why this value of 0.6931571960449219 and why I can't get lower stdv ? As the phenome start appearing at 2000 EPOCH, am i already overfitting at 2000 epochs ?
At this point stdv is totaly useless. So is there a kind of trade off ? With few epochs my model is less performant but gives me some insigh about uncertainty (even if I think they're not sufficient), where with lot of epochs I have better performances but no more uncertainty informations as all outputs have the same stdv.
Sorry for the long post and the language mistakes.
Thank you in advance for you help and any feed back.

I solved the problem of why my uncertainty could not get lower than 0.6931571960449219.
Actually this value is converging to log(2). This is due to my relu activation function on my last Dense Variational layer.
Indeed, the scale of tfd.Normal is a softplus (tf.math.softplus).
And softplus is implement like that : softplus(x) = log(exp(x) + 1). As my x doesn't go in negative values, my minumum incertainty il log(2).
A basic linear activation function solved the problem and my uncertainty has a normal behavior now.

Simple L1 loss in PyTorch

I want to calculate L1 loss in a neural network, I came across this example at https://discuss.pytorch.org/t/simple-l2-regularization/139/2, but there are some errors in this code.
Is this really how to calculate L1 Loss in a NN or is there a simpler way?
l1_crit = nn.L1Loss()
reg_loss = 0
for param in model.parameters():
reg_loss += l1_crit(param)
factor = 0.0005
loss += factor * reg_loss
Is this equivalent in any way to simple doing:
loss = torch.nn.L1Loss()
I assume not, because I am not passing along any network parameters. Just checking if there isn existing function to do this.

If I am understanding well, you want to compute the L1 loss of your model (as you say in the begining). However I think you might got confused with the discussion in the pytorch forum.
From what I understand, in the Pytorch forums, and the code you posted, the author is trying to normalize the network weights with L1 regularization. So it is trying to enforce that weights values fall in a sensible range (not too big, not too small). That is weights normalization using L1 normalization (that is why it is using model.parameters()). Normalization takes a value as input and produces a normalized value as output.
Check this for weights normalization: https://pytorch.org/docs/master/generated/torch.nn.utils.weight_norm.html
On the other hand, L1 Loss it is just a way to determine how 2 values differ from each other, so the "loss" is just measure of this difference. In the case of L1 Loss this error is computed with the Mean Absolute Error loss = |x-y| where x and y are the values to compare. So error compute takes 2 values as input and produces a value as output.
Check this for loss computing: https://pytorch.org/docs/master/generated/torch.nn.L1Loss.html
To answer your question: no, the above snippets are not equivalent, since the first is trying to do weights normalization and the second one, you are trying to compute a loss. This would be the loss computing with some context:
sample, target = dataset[i]
target_predicted = model(sample)
loss = torch.nn.L1Loss()
loss_value = loss(target, target_predicted)

Incorporate side conditions into Keras neural network

I want to train my neural network (in Keras) with an additional condition on the output elements.
An example:
Minimize my loss function MSE between network output y_pred and y_true.
Additionally, ensure that the norm of y_pred is less or equal 1.
Without the condition, the task is straightforward.
Note: The condition is not necessarily the vector norm of y_pred.
How can I implement the additional condition/restriction in a Keras (or maybe Tensorflow) model?

In principle, tensorflow (and keras) don't allow you to add hard constraints to your model.
You have to convert your invarient (norm <= 1) to a penalty function, which is added to the loss. This could look like this:
y_norm = tf.norm(y_pred)
norm_loss = tf.where(y_norm > 1, y_norm, 0)
total_loss = mse + norm_loss
Look at the docs of where. If your prediction has a norm bigger than one, backpropagation tries to minimize the norm. If it is less than or equal, this part of the loss is simply 0. No gradient is produced.
But this can be very hard to optimize. Your predictions could oscillate around a norm of 1. It is also possible to add a factor: total_loss = mse + 1000* norm_loss. Be very careful with this, it makes optimization even harder.
In the example above, the norm above one contributes linearly to the loss. This is called l1-regularization. You could also square it, which would become l2-regularization.
In your specific case, you could get creative. Why not normalize your predictions and the targets to one (just a suggestion, might be a bad idea)?
loss = mse(y_pred / tf.norm(y_pred), y_target / np.linalg.norm(y_target)

Training yields very low error but very incorrect (almost all ones) output

I am trying to train an autoencoder on some simulated data where an input is basically a vector with Gaussian noise applied. The code is almost exactly the same as in this example: https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/autoencoder.py
The only differences are I changed the network parameters and the cost function:
n_hidden_1 = 32 # 1st layer num features
n_hidden_2 = 16 # 2nd layer num features
n_input = 149 # LunaH-Map data input (number of counts per orbit)
cost = tf.reduce_mean(-tf.reduce_sum(y_true * tf.log(y_pred), reduction_indices=[1]))
During training, the error steadily decreases down to 0.00015, but the predicted and true values are very different, e.g.
as shown in this image. In fact, the predicted y vector is almost all ones.
How is it possible to get decreasing error with very wrong predictions? Is it possible that my network is just trying to move the weights closer to log(1) so as to minimize the cross entropy cost? If so, how do I combat this?

Yes, the network simply learns to to predict 1 which reduces the loss. The cross-entropy loss you are using is categorical which is used when y_true is one-hot code (example: [0,0,1,0]) and final layer is softmax (ensures sum of all output is 1). So when y_true[idx] is 0, the loss don't care while when the y_true[idx] is 1 and y_pred[idx] is 0 there is infinite(high) loss but if its 1 then loss is again 0.
Now categorical cross-entropy loss is not suitable for autoencoders. For real valued inputs and hence outputs its mean-squared-error, which is what is used in example that you cited. But there the final activation layer is sigmoid, implicitly saying that each element of x is 0/1. So either you need to convert your data to support the same or have the last layer of decoder linear.
If you do want to use cross-entropy loss you can use binary cross-entropy
For inputs with 0,1 binary cross-entropy: tf.reduce_mean(y_true * tf.log(y_pred) + (1-y_true) * tf.log(1-y_pred)). If you work it out in both misprediction case 0-1, 1-0, the network gets infinite loss. Note again here the final layer should be softmax and elements of x should be between 0 and 1

What are logits? What is the difference between softmax and softmax_cross_entropy_with_logits?

In the tensorflow API docs they use a keyword called logits. What is it? A lot of methods are written like:
tf.nn.softmax(logits, name=None)
If logits is just a generic Tensor input, why is it named logits?
Secondly, what is the difference between the following two methods?
tf.nn.softmax(logits, name=None)
tf.nn.softmax_cross_entropy_with_logits(logits, labels, name=None)
I know what tf.nn.softmax does, but not the other. An example would be really helpful.

The softmax+logits simply means that the function operates on the unscaled output of earlier layers and that the relative scale to understand the units is linear. It means, in particular, the sum of the inputs may not equal 1, that the values are not probabilities (you might have an input of 5). Internally, it first applies softmax to the unscaled output, and then and then computes the cross entropy of those values vs. what they "should" be as defined by the labels.
tf.nn.softmax produces the result of applying the softmax function to an input tensor. The softmax "squishes" the inputs so that sum(input) = 1, and it does the mapping by interpreting the inputs as log-probabilities (logits) and then converting them back into raw probabilities between 0 and 1. The shape of output of a softmax is the same as the input:
a = tf.constant(np.array([[.1, .3, .5, .9]]))
print s.run(tf.nn.softmax(a))
[[ 0.16838508 0.205666 0.25120102 0.37474789]]
See this answer for more about why softmax is used extensively in DNNs.
tf.nn.softmax_cross_entropy_with_logits combines the softmax step with the calculation of the cross-entropy loss after applying the softmax function, but it does it all together in a more mathematically careful way. It's similar to the result of:
sm = tf.nn.softmax(x)
ce = cross_entropy(sm)
The cross entropy is a summary metric: it sums across the elements. The output of tf.nn.softmax_cross_entropy_with_logits on a shape [2,5] tensor is of shape [2,1] (the first dimension is treated as the batch).
If you want to do optimization to minimize the cross entropy AND you're softmaxing after your last layer, you should use tf.nn.softmax_cross_entropy_with_logits instead of doing it yourself, because it covers numerically unstable corner cases in the mathematically right way. Otherwise, you'll end up hacking it by adding little epsilons here and there.
Edited 2016-02-07:
If you have single-class labels, where an object can only belong to one class, you might now consider using tf.nn.sparse_softmax_cross_entropy_with_logits so that you don't have to convert your labels to a dense one-hot array. This function was added after release 0.6.0.

Short version:
Suppose you have two tensors, where y_hat contains computed scores for each class (for example, from y = W*x +b) and y_true contains one-hot encoded true labels.
y_hat = ... # Predicted label, e.g. y = tf.matmul(X, W) + b
y_true = ... # True label, one-hot encoded
If you interpret the scores in y_hat as unnormalized log probabilities, then they are logits.
Additionally, the total cross-entropy loss computed in this manner:
y_hat_softmax = tf.nn.softmax(y_hat)
total_loss = tf.reduce_mean(-tf.reduce_sum(y_true * tf.log(y_hat_softmax), [1]))
is essentially equivalent to the total cross-entropy loss computed with the function softmax_cross_entropy_with_logits():
total_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_hat, y_true))
Long version:
In the output layer of your neural network, you will probably compute an array that contains the class scores for each of your training instances, such as from a computation y_hat = W*x + b. To serve as an example, below I've created a y_hat as a 2 x 3 array, where the rows correspond to the training instances and the columns correspond to classes. So here there are 2 training instances and 3 classes.
import tensorflow as tf
import numpy as np
sess = tf.Session()
# Create example y_hat.
y_hat = tf.convert_to_tensor(np.array([[0.5, 1.5, 0.1],[2.2, 1.3, 1.7]]))
sess.run(y_hat)
# array([[ 0.5, 1.5, 0.1],
# [ 2.2, 1.3, 1.7]])
Note that the values are not normalized (i.e. the rows don't add up to 1). In order to normalize them, we can apply the softmax function, which interprets the input as unnormalized log probabilities (aka logits) and outputs normalized linear probabilities.
y_hat_softmax = tf.nn.softmax(y_hat)
sess.run(y_hat_softmax)
# array([[ 0.227863 , 0.61939586, 0.15274114],
# [ 0.49674623, 0.20196195, 0.30129182]])
It's important to fully understand what the softmax output is saying. Below I've shown a table that more clearly represents the output above. It can be seen that, for example, the probability of training instance 1 being "Class 2" is 0.619. The class probabilities for each training instance are normalized, so the sum of each row is 1.0.
Pr(Class 1) Pr(Class 2) Pr(Class 3)
,--------------------------------------
Training instance 1 | 0.227863 | 0.61939586 | 0.15274114
Training instance 2 | 0.49674623 | 0.20196195 | 0.30129182
So now we have class probabilities for each training instance, where we can take the argmax() of each row to generate a final classification. From above, we may generate that training instance 1 belongs to "Class 2" and training instance 2 belongs to "Class 1".
Are these classifications correct? We need to measure against the true labels from the training set. You will need a one-hot encoded y_true array, where again the rows are training instances and columns are classes. Below I've created an example y_true one-hot array where the true label for training instance 1 is "Class 2" and the true label for training instance 2 is "Class 3".
y_true = tf.convert_to_tensor(np.array([[0.0, 1.0, 0.0],[0.0, 0.0, 1.0]]))
sess.run(y_true)
# array([[ 0., 1., 0.],
# [ 0., 0., 1.]])
Is the probability distribution in y_hat_softmax close to the probability distribution in y_true? We can use cross-entropy loss to measure the error.
We can compute the cross-entropy loss on a row-wise basis and see the results. Below we can see that training instance 1 has a loss of 0.479, while training instance 2 has a higher loss of 1.200. This result makes sense because in our example above, y_hat_softmax showed that training instance 1's highest probability was for "Class 2", which matches training instance 1 in y_true; however, the prediction for training instance 2 showed a highest probability for "Class 1", which does not match the true class "Class 3".
loss_per_instance_1 = -tf.reduce_sum(y_true * tf.log(y_hat_softmax), reduction_indices=[1])
sess.run(loss_per_instance_1)
# array([ 0.4790107 , 1.19967598])
What we really want is the total loss over all the training instances. So we can compute:
total_loss_1 = tf.reduce_mean(-tf.reduce_sum(y_true * tf.log(y_hat_softmax), reduction_indices=[1]))
sess.run(total_loss_1)
# 0.83934333897877944
Using softmax_cross_entropy_with_logits()
We can instead compute the total cross entropy loss using the tf.nn.softmax_cross_entropy_with_logits() function, as shown below.
loss_per_instance_2 = tf.nn.softmax_cross_entropy_with_logits(y_hat, y_true)
sess.run(loss_per_instance_2)
# array([ 0.4790107 , 1.19967598])
total_loss_2 = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_hat, y_true))
sess.run(total_loss_2)
# 0.83934333897877922
Note that total_loss_1 and total_loss_2 produce essentially equivalent results with some small differences in the very final digits. However, you might as well use the second approach: it takes one less line of code and accumulates less numerical error because the softmax is done for you inside of softmax_cross_entropy_with_logits().

tf.nn.softmax computes the forward propagation through a softmax layer. You use it during evaluation of the model when you compute the probabilities that the model outputs.
tf.nn.softmax_cross_entropy_with_logits computes the cost for a softmax layer. It is only used during training.
The logits are the unnormalized log probabilities output the model (the values output before the softmax normalization is applied to them).

Mathematical motivation for term
When we wish to constrain an output between 0 and 1, but our model architecture outputs unconstrained values, we can add a normalisation layer to enforce this.
A common choice is a sigmoid function.1 In binary classification this is typically the logistic function, and in multi-class tasks the multinomial logistic function (a.k.a softmax).2
If we want to interpret the outputs of our new final layer as 'probabilities', then (by implication) the unconstrained inputs to our sigmoid must be inverse-sigmoid(probabilities). In the logistic case this is equivalent to the log-odds of our probability (i.e. the log of the odds) a.k.a. logit:
That is why the arguments to softmax is called logits in Tensorflow - because under the assumption that softmax is the final layer in the model, and the output p is interpreted as a probability, the input x to this layer is interpretable as a logit:
Generalised term
In Machine Learning there is a propensity to generalise terminology borrowed from maths/stats/computer science, hence in Tensorflow logit (by analogy) is used as a synonym for the input to many normalisation functions.
While it has nice properties such as being easily diferentiable, and the aforementioned probabilistic interpretation, it is somewhat arbitrary.
softmax might be more accurately called softargmax, as it is a smooth approximation of the argmax function.

Above answers have enough description for the asked question.
Adding to that, Tensorflow has optimised the operation of applying the activation function then calculating cost using its own activation followed by cost functions. Hence it is a good practice to use: tf.nn.softmax_cross_entropy() over tf.nn.softmax(); tf.nn.cross_entropy()
You can find prominent difference between them in a resource intensive model.

Tensorflow 2.0 Compatible Answer: The explanations of dga and stackoverflowuser2010 are very detailed about Logits and the related Functions.
All those functions, when used in Tensorflow 1.x will work fine, but if you migrate your code from 1.x (1.14, 1.15, etc) to 2.x (2.0, 2.1, etc..), using those functions result in error.
Hence, specifying the 2.0 Compatible Calls for all the functions, we discussed above, if we migrate from 1.x to 2.x, for the benefit of the community.
Functions in 1.x:
tf.nn.softmax
tf.nn.softmax_cross_entropy_with_logits
tf.nn.sparse_softmax_cross_entropy_with_logits
Respective Functions when Migrated from 1.x to 2.x:
tf.compat.v2.nn.softmax
tf.compat.v2.nn.softmax_cross_entropy_with_logits
tf.compat.v2.nn.sparse_softmax_cross_entropy_with_logits
For more information about migration from 1.x to 2.x, please refer this Migration Guide.

One more thing that I would definitely like to highlight as logit is just a raw output, generally the output of last layer. This can be a negative value as well. If we use it as it's for "cross entropy" evaluation as mentioned below:
-tf.reduce_sum(y_true * tf.log(logits))
then it wont work. As log of -ve is not defined.
So using o softmax activation, will overcome this problem.
This is my understanding, please correct me if Im wrong.

Logits are the unnormalized outputs of a neural network. Softmax is a normalization function that squashes the outputs of a neural network so that they are all between 0 and 1 and sum to 1. Softmax_cross_entropy_with_logits is a loss function that takes in the outputs of a neural network (after they have been squashed by softmax) and the true labels for those outputs, and returns a loss value.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Ignore/Mask certain labels Softmax Cross-entropy - python

Related

Measuring incertainty in Bayesian Neural Network

Simple L1 loss in PyTorch

Incorporate side conditions into Keras neural network

Training yields very low error but very incorrect (almost all ones) output

What are logits? What is the difference between softmax and softmax_cross_entropy_with_logits?

Categories

Resources