Classification problems, such as logistic regression or multinomial
logistic regression, optimize a cross-entropy loss.
Normally, the cross-entropy layer follows the softmax layer,
which produces probability distribution.
In tensorflow, there are at least a dozen of different cross-entropy loss functions:
tf.losses.softmax_cross_entropy
tf.losses.sparse_softmax_cross_entropy
tf.losses.sigmoid_cross_entropy
tf.contrib.losses.softmax_cross_entropy
tf.contrib.losses.sigmoid_cross_entropy
tf.nn.softmax_cross_entropy_with_logits
tf.nn.sigmoid_cross_entropy_with_logits
...
Which one works only for binary classification and which are suitable for multi-class problems? When should you use sigmoid instead of softmax? How are sparse functions different from others and why is it only softmax?
Related (more math-oriented) discussion: What are the differences between all these cross-entropy losses in Keras and TensorFlow?.
Preliminary facts
In functional sense, the sigmoid is a partial case of the softmax function, when the number of classes equals 2. Both of them do the same operation: transform the logits (see below) to probabilities.
In simple binary classification, there's no big difference between the two,
however in case of multinomial classification, sigmoid allows to deal
with non-exclusive labels (a.k.a. multi-labels), while softmax deals
with exclusive classes (see below).
A logit (also called a score) is a raw unscaled value associated with a class, before computing the probability. In terms of neural network architecture, this means that a logit is an output of a dense (fully-connected) layer.
Tensorflow naming is a bit strange: all of the functions below accept logits, not probabilities, and apply the transformation themselves (which is simply more efficient).
Sigmoid functions family
tf.nn.sigmoid_cross_entropy_with_logits
tf.nn.weighted_cross_entropy_with_logits
tf.losses.sigmoid_cross_entropy
tf.contrib.losses.sigmoid_cross_entropy (DEPRECATED)
As stated earlier, sigmoid loss function is for binary classification.
But tensorflow functions are more general and allow to do
multi-label classification, when the classes are independent.
In other words, tf.nn.sigmoid_cross_entropy_with_logits solves N
binary classifications at once.
The labels must be one-hot encoded or can contain soft class probabilities.
tf.losses.sigmoid_cross_entropy in addition allows to set the in-batch weights,
i.e. make some examples more important than others.
tf.nn.weighted_cross_entropy_with_logits allows to set class weights
(remember, the classification is binary), i.e. make positive errors larger than
negative errors. This is useful when the training data is unbalanced.
Softmax functions family
tf.nn.softmax_cross_entropy_with_logits (DEPRECATED IN 1.5)
tf.nn.softmax_cross_entropy_with_logits_v2
tf.losses.softmax_cross_entropy
tf.contrib.losses.softmax_cross_entropy (DEPRECATED)
These loss functions should be used for multinomial mutually exclusive classification,
i.e. pick one out of N classes. Also applicable when N = 2.
The labels must be one-hot encoded or can contain soft class probabilities:
a particular example can belong to class A with 50% probability and class B
with 50% probability. Note that strictly speaking it doesn't mean that
it belongs to both classes, but one can interpret the probabilities this way.
Just like in sigmoid family, tf.losses.softmax_cross_entropy allows
to set the in-batch weights, i.e. make some examples more important than others.
As far as I know, as of tensorflow 1.3, there's no built-in way to set class weights.
[UPD] In tensorflow 1.5, v2 version was introduced and the original softmax_cross_entropy_with_logits loss got deprecated. The only difference between them is that in a newer version, backpropagation happens into both logits and labels (here's a discussion why this may be useful).
Sparse functions family
tf.nn.sparse_softmax_cross_entropy_with_logits
tf.losses.sparse_softmax_cross_entropy
tf.contrib.losses.sparse_softmax_cross_entropy (DEPRECATED)
Like ordinary softmax above, these loss functions should be used for
multinomial mutually exclusive classification, i.e. pick one out of N classes.
The difference is in labels encoding: the classes are specified as integers (class index),
not one-hot vectors. Obviously, this doesn't allow soft classes, but it
can save some memory when there are thousands or millions of classes.
However, note that logits argument must still contain logits per each class,
thus it consumes at least [batch_size, classes] memory.
Like above, tf.losses version has a weights argument which allows
to set the in-batch weights.
Sampled softmax functions family
tf.nn.sampled_softmax_loss
tf.contrib.nn.rank_sampled_softmax_loss
tf.nn.nce_loss
These functions provide another alternative for dealing with huge number of classes.
Instead of computing and comparing an exact probability distribution, they compute
a loss estimate from a random sample.
The arguments weights and biases specify a separate fully-connected layer that
is used to compute the logits for a chosen sample.
Like above, labels are not one-hot encoded, but have the shape [batch_size, num_true].
Sampled functions are only suitable for training. In test time, it's recommended to
use a standard softmax loss (either sparse or one-hot) to get an actual distribution.
Another alternative loss is tf.nn.nce_loss, which performs noise-contrastive estimation (if you're interested, see this very detailed discussion). I've included this function to the softmax family, because NCE guarantees approximation to softmax in the limit.
However, for version 1.5, softmax_cross_entropy_with_logits_v2 must be used instead, while using its argument with the argument key=..., for example
softmax_cross_entropy_with_logits_v2(_sentinel=None, labels=y,
logits=my_prediction, dim=-1, name=None)
While it is great that the accepted answer contains lot more info than what is asked, I felt that sharing a few generic thumb rules will make the answer more compact and intuitive:
There is just one real loss function. This is cross-entropy (CE). For a special case of a binary classification, this loss is called binary CE (note that the formula does not change) and for non-binary or multi-class situations the same is called categorical CE (CCE). Sparse functions are a special case of categorical CE where the expected values are not one-hot encoded but is an integer
We have the softmax formula which is an activation for multi-class scenario. For binary scenario, same formula is given a special name - sigmoid activation
Because there are sometimes numerical instabilities (for extreme values) when dealing with logarithmic functions, TF recommends combining the activation layer and the loss layer into one single function. This combined function is numerically more stable. TF provides these combined functions and they are suffixed with _with_logits
With this, let us now approach some situations. Say there is a simple binary classification problem - Is a cat present or not in the image? What is the choice of activation and loss function? It will be a sigmoid activation and a (binary)CE. So one could use sigmoid_cross_entropy or more preferably sigmoid_cross_entropy_with_logits. The latter combines the activation and the loss function and is supposed to be numerically stable.
How about a multi-class classification. Say we want to know if a cat or a dog or a donkey is present in the image. What is the choice of activation and loss function? It will be a softmax activation and a (categorical)CE. So one could use softmax_cross_entropy or more preferably softmax_cross_entropy_with_logits. We assume that the expected value is one-hot encoded (100 or 010 or 001). If (for some weird reason), this is not the case and the expected value is an integer (either 1 or 2 or 3) you could use the 'sparse' counterparts of the above functions.
There could be a third case. We could have a multi-label classification. So there could be a dog and a cat in the same image. How do we handle this? The trick here is to treat this situation as a multiple binary classification problems - basically cat or no cat / dog or no dog and donkey or no donkey. Find out the loss for each of the 3 (binary classifications) and then add them up. So essentially this boils down to using the sigmoid_cross_entropy_with_logits loss.
This answers the 3 specific questions you have asked. The functions shared above are all that are needed. You can ignore the tf.contrib family which is deprecated and should not be used.
Related
I have two parameters which I want a neural network to predict. What is the best or most conventional method to implement the loss function? Currently I just define the loss, torch.nn.L1Loss(), which automatically computes the mean for both parameters such that it becomes a scalar.
Another plausible method would be to create two loss functions, one for each parameter, and successively backpropagate.
I don't really see whether both methods compute the same thing and whether one method is better (or plain wrong).
The probelm could be seen as a Multi-task Probelm. For example, two parameters represents A-Task and B-Task respectively.
In Multi-task, two loss function is often used.
The usual form is as follows,
$$total_loss = \alpha * A_losss(\hat{y_1},y_1) + \bata * A_losss(\hat{y_2},y_2)$$
The $\alpha$ and $\beta$ is the weight of the loss function.Usually they are both 1 or 0.5.
I'm working on a text classification problem (e.g. sentiment analysis), where I need to classify a text string into one of five classes.
I just started using the Huggingface Transformer package and BERT with PyTorch. What I need is a classifier with a softmax layer on top so that I can do 5-way classification. Confusingly, there seem to be two relevant options in the Transformer package: BertForSequenceClassification and BertForMultipleChoice.
Which one should I use for my 5-way classification task? What are the appropriate use cases for them?
The documentation for BertForSequenceClassification doesn't mention softmax at all, although it does mention cross-entropy. I am not sure if this class is only for 2-class classification (i.e. logistic regression).
Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.
labels (torch.LongTensor of shape (batch_size,), optional, defaults to None) – Labels for computing the sequence classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).
The documentation for BertForMultipleChoice mentions softmax, but the way the labels are described, it sound like this class is for multi-label classification (that is, a binary classification for multiple labels).
Bert Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a softmax) e.g. for RocStories/SWAG tasks.
labels (torch.LongTensor of shape (batch_size,), optional, defaults to None) – Labels for computing the multiple choice classification loss. Indices should be in [0, ..., num_choices] where num_choices is the size of the second dimension of the input tensors.
Thank you for any help.
The answer to this lies in the (admittedly very brief) description of what the tasks are about:
[BertForMultipleChoice] [...], e.g. for RocStories/SWAG tasks.
When looking at the paper for SWAG, it seems that the task is actually learning to choose from varying options. This is in contrast to your "classical" classification task, in which the "choices" (i.e., classes) do not vary across your samples, which is exactly what BertForSequenceClassification is for.
Both variants can in fact be for an arbitrary number of classes (in the case of BertForSequenceClassification), respectively choices (for BertForMultipleChoice), via changing the labels parameter in the config. But, since it seems like you are dealing with a case of "classical classification", I suggest using the BertForSequenceClassification model.
Shortly addressing the missing Softmax in BertForSequenceClassification: Since classification tasks can compute loss across classes indipendent of the sample (unlike multiple choice, where your distribution is changing), this allows you to use Cross-Entropy Loss, which factors in Softmax in the backpropagation step for increased numerical stability.
I have a neural network with three layers. I've tried using tanh and sigmoid functions for my activations and then the output layer is just a simple linear function (I'm trying to model a regression problem).
For some reason my model seems to have a hard cut off where it will never predict a value above some threshold (even though it should). What reason could there be for this?
Here is what predictions from the model look like (with sigmoid activations):
update:
With relu activation, and switching from gradient descent to Adam, and adding L2 regularization... the model predicts same value for every input...
A linear layer regressing a single value will have outputs of the form
output = bias + sum(kernel * inputs)
If inputs comes from a tanh, then -1 <= inputs <= 1, and hence
bias - sum(abs(kernel)) <= output <= bias + sum(abs(kernel))
If you want an unbounded output, consider using an unbounded activation on all intermediate layers, e.g. relu.
I think your problem concerns the generalization/expressiveness of the model. Regression is a basic task, there should be no problem with the method itself, but problem with the execution. #DomJack explained how output is restricted for a specific set of parameters, but that only happens for anomaly data. In general, when training parameters would be tuned so that it will predict output correctly.
So first point is about the quality of training data. Make sure you have large enough training data (and it is split randomly if you split train/test from one dataset). Also, maybe trivial, but make sure you didn't mess up input/output value in preprocessing.
Another point is about the size of the network. Make sure you use large enough hidden layer.
I was recently learning about neural networks and came across MNIST data set. i understood that a sigmoid cost function is used to reduce the loss. Also, weights and biases gets adjusted and an optimum weights and biases are found after the training. the thing i did not understand is, on what basis the images are classified. For example, to classify whether a patient has cancer or not, data like age, location, etc., becomes features. in MNIST dataset, i did not find any of that. Am i missing something here. Please help me with this
First of all the Network pipeline consists of 3 main parts:
Input Manipulation:
Parameters that effect the finding of minimum:
Parameters like your descission function in your interpretation
layer (often fully connected layer)
In contrast to your regular machine learning pipeline where you have to extract features manually a CNN uses filters. (Filters like in edge detection or viola and jones).
If a filter runs across the images and is convolved with pixels it Produces an output.
This output is then interpreted by a neuron. If the output is above a threshold it is considered as valid (Step function counts 1 if valid or in case of Sigmoid it has a value on the sigmoid function).
The next steps are the same as before.
This is progressed until the interpretation layer (often softmax). This layer interprets your computation (if the filters are good adapted to your problem you will get a good predicted label) which means you have a low difference between (y_guess - y_true_label).
Now you can see that for the guess of y we have multiplied the input x with many weights w and also used functions on it. This can be seen like a chain rule in analysis.
To get better results the effect of a single weight on the input must be known. Therefore, you use Backpropagation which is a derivative of the Error with respect to all w. The Trick is that you can reuse derivatives which is more or less Backpropagation and it becomes easier since you can use Matrix vector notation.
If you have your gradient, you can use the normal concept of minimization where you walk along the steepest descent. (There are also many other gradient methods like adagrad or adam etc).
The steps will repeat until convergence or until you reach the maximum epochs.
So the answer is: THE COMPUTED WEIGHTS (FILTERS) ARE THE KEY TO DETECT NUMBERS AND DIGITS :)
In keras backend we have a flag with_logits in K.binary_crossentropy. What is the difference between normal binary crossentropy and binary crossentropy with logits? Suppose I am using a seq2seq model and my output sequence is of type 100111100011101.
What should I use for an recursive LSTM or RNN to learn from this data provided I am giving a similar sequence in the input along with timesteps?
This depends on whether or not you have a sigmoid layer just before the loss function.
If there is a sigmoid layer, it will squeeze the class scores into probabilities, in this case from_logits should be False. The loss function will transform the probabilities into logits, because that's what tf.nn.sigmoid_cross_entropy_with_logits expects.
If the output is already a logit (i.e. the raw score), pass from_logits=True, no transformation will be made.
Both options are possible and the choice depends on your network architecture. By the way if the term logit seems scary, take a look at this question which discusses it in detail.