Why do the keras loss functions reduce the dimensionality by one?

Why do the keras loss functions reduce the dimensionality by one? - python

When computing the loss between y_true and y_pred, the keras loss functions reduce the dimensionality by one. For example, when training a network on pairs of 64x64 greyscale images with batch size = 8, the shape of y_true and y_pred would be (8, 64, 64). The keras loss functions will produce a loss tensor with shape (8, 64), averaging over the last dimension.
I do not get why that would be necessary, all it does is average the loss over the rows of the image. Doesn't the network need the loss to be calculated individually for every output value (and therefore conserve the shape)? As far as I understand it, backpropagation looks at the individual loss of each output value compared to the target, and then updates previous weights accordingly. How can it do that, just knowing the averaged loss of each row, not every value individually? Here is a code snippet that shows the behaviour I described:
y_true = K.random_uniform([8,64,64])
y_pred = K.random_uniform([8,64,64])
c= mean_absolute_error(y_true,y_pred)
print(K.eval(tf.shape(c))) # (8,64)

I wondered the same thing. I believe, Keras assumes your data to have the following dimensions: [batch, W, H, n_classes] which means averaging over axis=-1 means averaging the loss over all different classes. However, in your case you do not have that dimension because you presumably do a binary classification in a grayscale image. So instead it ends up averaging the loss over the rows/columns. Interestingly enough, the model can still train and even improve in performance like this, which makes me believe that people in similar situations often just train their model without ever noticing.
You can avoid this by adding a dummy axis to your data.
This is how I got there:
From: https://keras.io/api/losses/
"(Note ondN-1: all loss functions reduce by 1 dimension, usually axis=-1.) “
Furthermore: "loss class instances feature a reduction constructor argument, which defaults to "sum_over_batch_size" (i.e. average). Allowable values are "sum_over_batch_size", "sum", and "none":
• "sum_over_batch_size" means the loss instance will return the average of the per-sample losses in the batch.
• "sum" means the loss instance will return the sum of the per-sample losses in the batch.
• "none" means the loss instance will return the full array of per-sample losses. "
From https://www.tensorflow.org/api_docs/python/tf/keras/losses/Reduction
"Caution: Verify the shape of the outputs when using Reduction.NONE. The builtin loss functions wrapped by the loss classes reduce one dimension (axis=-1, or axis if specified by loss function). Reduction.NONE just means that no additional reduction is applied by the class wrapper. For categorical losses with an example input shape of [batch, W, H, n_classes] the n_classes dimension is reduced. For pointwise losses you must include a dummy axis so that [batch, W, H, 1] is reduced to [batch, W, H]. Without the dummy axis [batch, W, H] will be incorrectly reduced to [batch, W]."

Related

Shape of target and predictions tensors in PyTorch loss functions

I am confused with the input shapes for tensors in nn.CrossEntropyLoss.
I am trying to implement a simple autoencoder for text sequences. The core of my problem can be illustrated by the following code
predictions = torch.rand(2, 3, 4)
target = torch.rand(2, 3)
print(predictions.shape)
print(target.shape)
nn.CrossEntropyLoss(predictions.transpose(1, 2), target)
In my case predictions has the shape (time_step, batch_size, vocabulary_size) while target has the shape (time_step, batch_size). Next I am transposing the predictions as per description which says that the second dimension of predictions should be the number of classes - vocabulary_size in my case. The code returns an error RuntimeError: bool value of Tensor with more than one value is ambiguous. Could someone please enlighten me how to use the damn thing? Thank you in advance!

You are not calling the loss function, but you are building it. The signature of the nn.CrossEntropyLoss constructor is:
nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean')
You are setting the predictions as the weight and the target as size_average,
where weight is an optional rescaling of the classes and size_average is deprecated, but expects a boolean. The target is a tensor of size [2, 3], which cannot be converted to a boolean.
You need to create the loss function first, as you don't use any of the optional parameters of the constructor, you don't specify any of them.
# Create the loss function
cross_entropy = nn.CrossEntropyLoss()
# Call it to calculate the loss for your data
loss = cross_entropy(predictions.transpose(1, 2), target)
Alternatively, you can directly use the functional version nn.functional.cross_entropy:
import torch.nn.functional as F
loss = F.cross_entropy(predictions.transpose(1, 2), target)
The advantage of the class version, compared to the functional version, is that you only need to specify the extra parameters once (such as the weight) instead of having to supply them manually each time.
Regarding the dimensions of the tensors, the batch size must be the first dimension, because the losses are averaged per element in the batch, so you have tensor of losses with size [batch_size]. If you used reduction="none", you would get back theses losses per element in the batch, but by default (reduction="mean") the mean of these losses is returned. That result would be different if the mean is taken across time steps rather than batches.
Lastly, the targets need to be the class indices, which means they need to have type torch.long not torch.float. In this randomly chosen example, you could create the random classes with torch.randint.
predictions = torch.rand(2, 3, 4)
target = torch.randint(4, (2, 3))
# Reorder the dimensions
# From: [time_step, batch_size, vocabulary_size]
# To: [batch_size, vocabulary_size, time_step]
predictions = predictions.permute(1, 2, 0)
# From: [time_step, batch_size]
# To: [batch_size, time_step]
target = target.transpose(0, 1)
F.cross_entropy(predictions, target)

Training yields very low error but very incorrect (almost all ones) output

I am trying to train an autoencoder on some simulated data where an input is basically a vector with Gaussian noise applied. The code is almost exactly the same as in this example: https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/autoencoder.py
The only differences are I changed the network parameters and the cost function:
n_hidden_1 = 32 # 1st layer num features
n_hidden_2 = 16 # 2nd layer num features
n_input = 149 # LunaH-Map data input (number of counts per orbit)
cost = tf.reduce_mean(-tf.reduce_sum(y_true * tf.log(y_pred), reduction_indices=[1]))
During training, the error steadily decreases down to 0.00015, but the predicted and true values are very different, e.g.
as shown in this image. In fact, the predicted y vector is almost all ones.
How is it possible to get decreasing error with very wrong predictions? Is it possible that my network is just trying to move the weights closer to log(1) so as to minimize the cross entropy cost? If so, how do I combat this?

Yes, the network simply learns to to predict 1 which reduces the loss. The cross-entropy loss you are using is categorical which is used when y_true is one-hot code (example: [0,0,1,0]) and final layer is softmax (ensures sum of all output is 1). So when y_true[idx] is 0, the loss don't care while when the y_true[idx] is 1 and y_pred[idx] is 0 there is infinite(high) loss but if its 1 then loss is again 0.
Now categorical cross-entropy loss is not suitable for autoencoders. For real valued inputs and hence outputs its mean-squared-error, which is what is used in example that you cited. But there the final activation layer is sigmoid, implicitly saying that each element of x is 0/1. So either you need to convert your data to support the same or have the last layer of decoder linear.
If you do want to use cross-entropy loss you can use binary cross-entropy
For inputs with 0,1 binary cross-entropy: tf.reduce_mean(y_true * tf.log(y_pred) + (1-y_true) * tf.log(1-y_pred)). If you work it out in both misprediction case 0-1, 1-0, the network gets infinite loss. Note again here the final layer should be softmax and elements of x should be between 0 and 1

How can I implement a weighted cross entropy loss in tensorflow using sparse_softmax_cross_entropy_with_logits

I am starting to use tensorflow (coming from Caffe), and I am using the loss sparse_softmax_cross_entropy_with_logits. The function accepts labels like 0,1,...C-1 instead of onehot encodings. Now, I want to use a weighting depending on the class label; I know that this could be done maybe with a matrix multiplication if I use softmax_cross_entropy_with_logits (one hot encoding), Is there any way to do the same with sparse_softmax_cross_entropy_with_logits?

import tensorflow as tf
import numpy as np
np.random.seed(123)
sess = tf.InteractiveSession()
# let's say we have the logits and labels of a batch of size 6 with 5 classes
logits = tf.constant(np.random.randint(0, 10, 30).reshape(6, 5), dtype=tf.float32)
labels = tf.constant(np.random.randint(0, 5, 6), dtype=tf.int32)
# specify some class weightings
class_weights = tf.constant([0.3, 0.1, 0.2, 0.3, 0.1])
# specify the weights for each sample in the batch (without having to compute the onehot label matrix)
weights = tf.gather(class_weights, labels)
# compute the loss
tf.losses.sparse_softmax_cross_entropy(labels, logits, weights).eval()

Specifically for binary classification, there is weighted_cross_entropy_with_logits, that computes weighted softmax cross entropy.
sparse_softmax_cross_entropy_with_logits is tailed for a high-efficient non-weighted operation (see SparseSoftmaxXentWithLogitsOp which uses SparseXentEigenImpl under the hood), so it's not "pluggable".
In multi-class case, your option is either switch to one-hot encoding or use tf.losses.sparse_softmax_cross_entropy loss function in a hacky way, as already suggested, where you will have to pass the weights depending on the labels in a current batch.

The class weights are multiplied by the logits, so that still works for sparse_softmax_cross_entropy_with_logits. Refer to this solution for "Loss function for class imbalanced binary classifier in Tensor flow."
As a side note, you can pass weights directly into sparse_softmax_cross_entropy
tf.contrib.losses.sparse_softmax_cross_entropy(logits, labels, weight=1.0, scope=None)
This method is for cross-entropy loss using
tf.nn.sparse_softmax_cross_entropy_with_logits.
Weight acts as a coefficient for the loss. If a scalar is provided, then the loss is simply scaled by the given value. If weight is a tensor of size [batch_size], then the loss weights apply to each corresponding sample.

What are logits? What is the difference between softmax and softmax_cross_entropy_with_logits?

In the tensorflow API docs they use a keyword called logits. What is it? A lot of methods are written like:
tf.nn.softmax(logits, name=None)
If logits is just a generic Tensor input, why is it named logits?
Secondly, what is the difference between the following two methods?
tf.nn.softmax(logits, name=None)
tf.nn.softmax_cross_entropy_with_logits(logits, labels, name=None)
I know what tf.nn.softmax does, but not the other. An example would be really helpful.

The softmax+logits simply means that the function operates on the unscaled output of earlier layers and that the relative scale to understand the units is linear. It means, in particular, the sum of the inputs may not equal 1, that the values are not probabilities (you might have an input of 5). Internally, it first applies softmax to the unscaled output, and then and then computes the cross entropy of those values vs. what they "should" be as defined by the labels.
tf.nn.softmax produces the result of applying the softmax function to an input tensor. The softmax "squishes" the inputs so that sum(input) = 1, and it does the mapping by interpreting the inputs as log-probabilities (logits) and then converting them back into raw probabilities between 0 and 1. The shape of output of a softmax is the same as the input:
a = tf.constant(np.array([[.1, .3, .5, .9]]))
print s.run(tf.nn.softmax(a))
[[ 0.16838508 0.205666 0.25120102 0.37474789]]
See this answer for more about why softmax is used extensively in DNNs.
tf.nn.softmax_cross_entropy_with_logits combines the softmax step with the calculation of the cross-entropy loss after applying the softmax function, but it does it all together in a more mathematically careful way. It's similar to the result of:
sm = tf.nn.softmax(x)
ce = cross_entropy(sm)
The cross entropy is a summary metric: it sums across the elements. The output of tf.nn.softmax_cross_entropy_with_logits on a shape [2,5] tensor is of shape [2,1] (the first dimension is treated as the batch).
If you want to do optimization to minimize the cross entropy AND you're softmaxing after your last layer, you should use tf.nn.softmax_cross_entropy_with_logits instead of doing it yourself, because it covers numerically unstable corner cases in the mathematically right way. Otherwise, you'll end up hacking it by adding little epsilons here and there.
Edited 2016-02-07:
If you have single-class labels, where an object can only belong to one class, you might now consider using tf.nn.sparse_softmax_cross_entropy_with_logits so that you don't have to convert your labels to a dense one-hot array. This function was added after release 0.6.0.

Short version:
Suppose you have two tensors, where y_hat contains computed scores for each class (for example, from y = W*x +b) and y_true contains one-hot encoded true labels.
y_hat = ... # Predicted label, e.g. y = tf.matmul(X, W) + b
y_true = ... # True label, one-hot encoded
If you interpret the scores in y_hat as unnormalized log probabilities, then they are logits.
Additionally, the total cross-entropy loss computed in this manner:
y_hat_softmax = tf.nn.softmax(y_hat)
total_loss = tf.reduce_mean(-tf.reduce_sum(y_true * tf.log(y_hat_softmax), [1]))
is essentially equivalent to the total cross-entropy loss computed with the function softmax_cross_entropy_with_logits():
total_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_hat, y_true))
Long version:
In the output layer of your neural network, you will probably compute an array that contains the class scores for each of your training instances, such as from a computation y_hat = W*x + b. To serve as an example, below I've created a y_hat as a 2 x 3 array, where the rows correspond to the training instances and the columns correspond to classes. So here there are 2 training instances and 3 classes.
import tensorflow as tf
import numpy as np
sess = tf.Session()
# Create example y_hat.
y_hat = tf.convert_to_tensor(np.array([[0.5, 1.5, 0.1],[2.2, 1.3, 1.7]]))
sess.run(y_hat)
# array([[ 0.5, 1.5, 0.1],
# [ 2.2, 1.3, 1.7]])
Note that the values are not normalized (i.e. the rows don't add up to 1). In order to normalize them, we can apply the softmax function, which interprets the input as unnormalized log probabilities (aka logits) and outputs normalized linear probabilities.
y_hat_softmax = tf.nn.softmax(y_hat)
sess.run(y_hat_softmax)
# array([[ 0.227863 , 0.61939586, 0.15274114],
# [ 0.49674623, 0.20196195, 0.30129182]])
It's important to fully understand what the softmax output is saying. Below I've shown a table that more clearly represents the output above. It can be seen that, for example, the probability of training instance 1 being "Class 2" is 0.619. The class probabilities for each training instance are normalized, so the sum of each row is 1.0.
Pr(Class 1) Pr(Class 2) Pr(Class 3)
,--------------------------------------
Training instance 1 | 0.227863 | 0.61939586 | 0.15274114
Training instance 2 | 0.49674623 | 0.20196195 | 0.30129182
So now we have class probabilities for each training instance, where we can take the argmax() of each row to generate a final classification. From above, we may generate that training instance 1 belongs to "Class 2" and training instance 2 belongs to "Class 1".
Are these classifications correct? We need to measure against the true labels from the training set. You will need a one-hot encoded y_true array, where again the rows are training instances and columns are classes. Below I've created an example y_true one-hot array where the true label for training instance 1 is "Class 2" and the true label for training instance 2 is "Class 3".
y_true = tf.convert_to_tensor(np.array([[0.0, 1.0, 0.0],[0.0, 0.0, 1.0]]))
sess.run(y_true)
# array([[ 0., 1., 0.],
# [ 0., 0., 1.]])
Is the probability distribution in y_hat_softmax close to the probability distribution in y_true? We can use cross-entropy loss to measure the error.
We can compute the cross-entropy loss on a row-wise basis and see the results. Below we can see that training instance 1 has a loss of 0.479, while training instance 2 has a higher loss of 1.200. This result makes sense because in our example above, y_hat_softmax showed that training instance 1's highest probability was for "Class 2", which matches training instance 1 in y_true; however, the prediction for training instance 2 showed a highest probability for "Class 1", which does not match the true class "Class 3".
loss_per_instance_1 = -tf.reduce_sum(y_true * tf.log(y_hat_softmax), reduction_indices=[1])
sess.run(loss_per_instance_1)
# array([ 0.4790107 , 1.19967598])
What we really want is the total loss over all the training instances. So we can compute:
total_loss_1 = tf.reduce_mean(-tf.reduce_sum(y_true * tf.log(y_hat_softmax), reduction_indices=[1]))
sess.run(total_loss_1)
# 0.83934333897877944
Using softmax_cross_entropy_with_logits()
We can instead compute the total cross entropy loss using the tf.nn.softmax_cross_entropy_with_logits() function, as shown below.
loss_per_instance_2 = tf.nn.softmax_cross_entropy_with_logits(y_hat, y_true)
sess.run(loss_per_instance_2)
# array([ 0.4790107 , 1.19967598])
total_loss_2 = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_hat, y_true))
sess.run(total_loss_2)
# 0.83934333897877922
Note that total_loss_1 and total_loss_2 produce essentially equivalent results with some small differences in the very final digits. However, you might as well use the second approach: it takes one less line of code and accumulates less numerical error because the softmax is done for you inside of softmax_cross_entropy_with_logits().

tf.nn.softmax computes the forward propagation through a softmax layer. You use it during evaluation of the model when you compute the probabilities that the model outputs.
tf.nn.softmax_cross_entropy_with_logits computes the cost for a softmax layer. It is only used during training.
The logits are the unnormalized log probabilities output the model (the values output before the softmax normalization is applied to them).

Mathematical motivation for term
When we wish to constrain an output between 0 and 1, but our model architecture outputs unconstrained values, we can add a normalisation layer to enforce this.
A common choice is a sigmoid function.1 In binary classification this is typically the logistic function, and in multi-class tasks the multinomial logistic function (a.k.a softmax).2
If we want to interpret the outputs of our new final layer as 'probabilities', then (by implication) the unconstrained inputs to our sigmoid must be inverse-sigmoid(probabilities). In the logistic case this is equivalent to the log-odds of our probability (i.e. the log of the odds) a.k.a. logit:
That is why the arguments to softmax is called logits in Tensorflow - because under the assumption that softmax is the final layer in the model, and the output p is interpreted as a probability, the input x to this layer is interpretable as a logit:
Generalised term
In Machine Learning there is a propensity to generalise terminology borrowed from maths/stats/computer science, hence in Tensorflow logit (by analogy) is used as a synonym for the input to many normalisation functions.
While it has nice properties such as being easily diferentiable, and the aforementioned probabilistic interpretation, it is somewhat arbitrary.
softmax might be more accurately called softargmax, as it is a smooth approximation of the argmax function.

Above answers have enough description for the asked question.
Adding to that, Tensorflow has optimised the operation of applying the activation function then calculating cost using its own activation followed by cost functions. Hence it is a good practice to use: tf.nn.softmax_cross_entropy() over tf.nn.softmax(); tf.nn.cross_entropy()
You can find prominent difference between them in a resource intensive model.

Tensorflow 2.0 Compatible Answer: The explanations of dga and stackoverflowuser2010 are very detailed about Logits and the related Functions.
All those functions, when used in Tensorflow 1.x will work fine, but if you migrate your code from 1.x (1.14, 1.15, etc) to 2.x (2.0, 2.1, etc..), using those functions result in error.
Hence, specifying the 2.0 Compatible Calls for all the functions, we discussed above, if we migrate from 1.x to 2.x, for the benefit of the community.
Functions in 1.x:
tf.nn.softmax
tf.nn.softmax_cross_entropy_with_logits
tf.nn.sparse_softmax_cross_entropy_with_logits
Respective Functions when Migrated from 1.x to 2.x:
tf.compat.v2.nn.softmax
tf.compat.v2.nn.softmax_cross_entropy_with_logits
tf.compat.v2.nn.sparse_softmax_cross_entropy_with_logits
For more information about migration from 1.x to 2.x, please refer this Migration Guide.

One more thing that I would definitely like to highlight as logit is just a raw output, generally the output of last layer. This can be a negative value as well. If we use it as it's for "cross entropy" evaluation as mentioned below:
-tf.reduce_sum(y_true * tf.log(logits))
then it wont work. As log of -ve is not defined.
So using o softmax activation, will overcome this problem.
This is my understanding, please correct me if Im wrong.

Logits are the unnormalized outputs of a neural network. Softmax is a normalization function that squashes the outputs of a neural network so that they are all between 0 and 1 and sum to 1. Softmax_cross_entropy_with_logits is a loss function that takes in the outputs of a neural network (after they have been squashed by softmax) and the true labels for those outputs, and returns a loss value.

Deconvolutional autoencoder in theano

I'm new to theano and trying to use the examples convolutional network and denoising autoencoder to make a denoising convolutional network. I am currently struggling with how to make W', the reverse weights. In this paper they use tied weights for W' that are flipped in both dimensions.
I'm currently working on a 1d signal, so my image shape is (batch_size, 1, 1, 1000) and filter/W size is (num_kernels, 1, 1, 10) for example. The output of the convolution is then (batch_size, num_kernels, 1, 991).
Since I want to W' to be just the flipped in 2 dimensions (or 1d in my case), I'm tempted to do this
w_value = numpy_rng.uniform(low=-W_bound, high=W_bound, size=filter_shape)
self.W = theano.shared(np.asarray((w_value), dtype=theano.config.floatX), borrow=True)
self.W_prime = T.repeat(self.W[:, :, :, ::-1], num_kernels, axis=1)
where I reverse flip it in the relevant dimension and repeat those weights so that they are the same dimension as the feature maps from the hidden layer.
With this setup, do I only have to get the gradients for W to update or should W_prime also be a part of the grad computation?
When I do it like this, the MSE drops a lot after the first minibatch and then stops changing. Using cross entropy gives NaN from the first iteration. I don't know if that is related to this issue or if it's one of many other potential bugs I have in my code.

I can't comment on the validity of your W_prime approach but I can say that you only need to compute the gradient of the cost with respect to each of the original shared variables. Your W_prime is a symbolic function of W, not a shared variable itself so you don't need to compute gradients with respect to W_prime.
Whenever you get NaNs, the first thing to try is to reduce the size of the learning rate.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why do the keras loss functions reduce the dimensionality by one? - python

Related

Shape of target and predictions tensors in PyTorch loss functions

Training yields very low error but very incorrect (almost all ones) output

How can I implement a weighted cross entropy loss in tensorflow using sparse_softmax_cross_entropy_with_logits

What are logits? What is the difference between softmax and softmax_cross_entropy_with_logits?

Deconvolutional autoencoder in theano

Categories

Resources