How does keras define "accuracy" and "loss"? - python

I can't find how Keras defines "accuracy" and "loss". I know I can specify different metrics (e.g. mse, cross entropy) - but keras prints out a standard "accuracy". How is that defined? Likewise for loss: I know I can specify different types of regularization -- are those in the loss?
Ideally, I'd like to print out the equation used to define it; if not, I'll settle for an answer here.

Have a look at metrics.py, there you can find definition of all available metrics including different types of accuracy. Accuracy is not printed unless you add it to the list of desired metrics when you compile your model.
Regularizers are by definition added to the loss. For example, see add_loss method of the Layerclass.
Update
The type of accuracy is determined based on the objective function, see training.py. The default choice is categorical_accuracy. Other types like binary_accuracy and sparse_categorical_accuracy are selected when the objective function is either binary or sparse.

Related

Pytorch - predict multiple parameters in neural network

I have two parameters which I want a neural network to predict. What is the best or most conventional method to implement the loss function? Currently I just define the loss, torch.nn.L1Loss(), which automatically computes the mean for both parameters such that it becomes a scalar.
Another plausible method would be to create two loss functions, one for each parameter, and successively backpropagate.
I don't really see whether both methods compute the same thing and whether one method is better (or plain wrong).
The probelm could be seen as a Multi-task Probelm. For example, two parameters represents A-Task and B-Task respectively.
In Multi-task, two loss function is often used.
The usual form is as follows,
$$total_loss = \alpha * A_losss(\hat{y_1},y_1) + \bata * A_losss(\hat{y_2},y_2)$$
The $\alpha$ and $\beta$ is the weight of the loss function.Usually they are both 1 or 0.5.

What is keras loss class vs function?

In Keras loss page which is here there are 2 main distinction I saw is loss classes vs loss functions? Can anyone explain why for same losses these 2 APIs given? Is it just for class initialization or any other purposes? Also if anyone can explain that in which cases we should use which one that would be great.
Thanks in advance.
A deep learning model can be built and trained in multiple ways.
The simplest approach to build a model would be to use Keras functional API or sequential API to build the model, use compile method to specify optimizer, loss, metrics, etc, and use the fit method for training the model.
If you choose to build the model this way the compile method accepts loss class.
Note: You can use the loss function as well
The actual logic that computes the loss is present in the special call method of a class which is used internally in the fit method.
However, there are cases (mostly in research) where the training loop has to be written in a certain way from scratch, in that case, you can use loss function to compute losses.
Note: You can use the loss class as well
The loss class gives you some extra functionality like specifying logit value, reduction technique, etc. So if your code requires the use of those functionality use the loss class to compute the losses.
If you do not require any such functionality you can simply use the loss function.
Note: Under the hood, both functions call the same TensorFlow graph.

Smartest way to add KL Divergence into (Variational) Auto Encoder

I have an Auto Encoder model with multiple outputs and weightening which a want to enrich into a Variational Auto Encoder.
I followed this: https://keras.io/examples/generative/vae/ official keras tutorial.
But if a manually adapt the train_step function I lose the majority of my original implementation details:
I got two weighted optimization goals: re-construction (decoder) and classification (softmax)
accuracy metrics for the classification
the original fit method also takes care of the validation data and corresponding metrics
Adding the suggested sampling layer according to the keras link is no problem, but to correctly implement the Kullback-Leibler-Loss as it depends on the additional parameters z_mu and z_log_var which is not supported by standard Keras losses.
I search for some workarounds to solve this issue but none of them was succesfull:
re-writing the train_step: its hard to fully re-implement all details (
weightening, multiple losses with different inputs -> decoder: data, classifier: labels etc)
adding a psyeudo layer to the ecoder that calculates the loss: https://tiao.io/post/tutorial-on-variational-autoencoders-with-a-concise-keras-implementation/ like here. But here is the problem that the add loss function does not specify to which key and how KL-Loss is added to the model's total loss
Adding everything as global/top-level element to make the z_mu, z_log_var accessible for the loss calculation like here: https://www.machinecurve.com/index.php/2019/12/30/how-to-create-a-variational-autoencoder-with-keras/. This is the approach I like the least as my current architecture is parametrized to be able to e.g. perform hyperopt tuning
I was not able to find a pleasing solution to this problem, as VAE's are more and more popular I am surprised by the phenomenon that there is no extended tutorial about this especially when dealing with multiple in- and outputs. Or I am just unable to find the right answers through my query.
Any opinions welcome!
After a couple of re-designs I and bug-ticket tracing I found this recent example:
here
The VAE examples can be found at the very bottom of the post.
Solution: write your own train_step: cleanest but also hardest solution depending how complex your loss calculation is.
Solution: use a functional approach the access the necessary variables and add the loss with .add_loss: not very clean but straight to implement (you will lose an additional loss tracker for the KL-loss)
To achieve my weighting I weighted the KL loss before I added it via .add_loss according to the weight of my decoder loss.
Note: The first solution I tested was to define a custom loss function for the mse+kl loss and added it into my functional designed model - this works if one turns of the tf eager eval off. But be careful this really slows down your network and you will lose the ability to monitor your training via tensorboard if you don't have admin rights for your nvidia gpu (profile_batch=0 does not turn off profiling if eager mode is switched off, therefore you ran into INSUFFICENT_PRIVILEDGES Errors with the CUPTI driver)

How to choose cross-entropy loss in TensorFlow?

Classification problems, such as logistic regression or multinomial
logistic regression, optimize a cross-entropy loss.
Normally, the cross-entropy layer follows the softmax layer,
which produces probability distribution.
In tensorflow, there are at least a dozen of different cross-entropy loss functions:
tf.losses.softmax_cross_entropy
tf.losses.sparse_softmax_cross_entropy
tf.losses.sigmoid_cross_entropy
tf.contrib.losses.softmax_cross_entropy
tf.contrib.losses.sigmoid_cross_entropy
tf.nn.softmax_cross_entropy_with_logits
tf.nn.sigmoid_cross_entropy_with_logits
...
Which one works only for binary classification and which are suitable for multi-class problems? When should you use sigmoid instead of softmax? How are sparse functions different from others and why is it only softmax?
Related (more math-oriented) discussion: What are the differences between all these cross-entropy losses in Keras and TensorFlow?.
Preliminary facts
In functional sense, the sigmoid is a partial case of the softmax function, when the number of classes equals 2. Both of them do the same operation: transform the logits (see below) to probabilities.
In simple binary classification, there's no big difference between the two,
however in case of multinomial classification, sigmoid allows to deal
with non-exclusive labels (a.k.a. multi-labels), while softmax deals
with exclusive classes (see below).
A logit (also called a score) is a raw unscaled value associated with a class, before computing the probability. In terms of neural network architecture, this means that a logit is an output of a dense (fully-connected) layer.
Tensorflow naming is a bit strange: all of the functions below accept logits, not probabilities, and apply the transformation themselves (which is simply more efficient).
Sigmoid functions family
tf.nn.sigmoid_cross_entropy_with_logits
tf.nn.weighted_cross_entropy_with_logits
tf.losses.sigmoid_cross_entropy
tf.contrib.losses.sigmoid_cross_entropy (DEPRECATED)
As stated earlier, sigmoid loss function is for binary classification.
But tensorflow functions are more general and allow to do
multi-label classification, when the classes are independent.
In other words, tf.nn.sigmoid_cross_entropy_with_logits solves N
binary classifications at once.
The labels must be one-hot encoded or can contain soft class probabilities.
tf.losses.sigmoid_cross_entropy in addition allows to set the in-batch weights,
i.e. make some examples more important than others.
tf.nn.weighted_cross_entropy_with_logits allows to set class weights
(remember, the classification is binary), i.e. make positive errors larger than
negative errors. This is useful when the training data is unbalanced.
Softmax functions family
tf.nn.softmax_cross_entropy_with_logits (DEPRECATED IN 1.5)
tf.nn.softmax_cross_entropy_with_logits_v2
tf.losses.softmax_cross_entropy
tf.contrib.losses.softmax_cross_entropy (DEPRECATED)
These loss functions should be used for multinomial mutually exclusive classification,
i.e. pick one out of N classes. Also applicable when N = 2.
The labels must be one-hot encoded or can contain soft class probabilities:
a particular example can belong to class A with 50% probability and class B
with 50% probability. Note that strictly speaking it doesn't mean that
it belongs to both classes, but one can interpret the probabilities this way.
Just like in sigmoid family, tf.losses.softmax_cross_entropy allows
to set the in-batch weights, i.e. make some examples more important than others.
As far as I know, as of tensorflow 1.3, there's no built-in way to set class weights.
[UPD] In tensorflow 1.5, v2 version was introduced and the original softmax_cross_entropy_with_logits loss got deprecated. The only difference between them is that in a newer version, backpropagation happens into both logits and labels (here's a discussion why this may be useful).
Sparse functions family
tf.nn.sparse_softmax_cross_entropy_with_logits
tf.losses.sparse_softmax_cross_entropy
tf.contrib.losses.sparse_softmax_cross_entropy (DEPRECATED)
Like ordinary softmax above, these loss functions should be used for
multinomial mutually exclusive classification, i.e. pick one out of N classes.
The difference is in labels encoding: the classes are specified as integers (class index),
not one-hot vectors. Obviously, this doesn't allow soft classes, but it
can save some memory when there are thousands or millions of classes.
However, note that logits argument must still contain logits per each class,
thus it consumes at least [batch_size, classes] memory.
Like above, tf.losses version has a weights argument which allows
to set the in-batch weights.
Sampled softmax functions family
tf.nn.sampled_softmax_loss
tf.contrib.nn.rank_sampled_softmax_loss
tf.nn.nce_loss
These functions provide another alternative for dealing with huge number of classes.
Instead of computing and comparing an exact probability distribution, they compute
a loss estimate from a random sample.
The arguments weights and biases specify a separate fully-connected layer that
is used to compute the logits for a chosen sample.
Like above, labels are not one-hot encoded, but have the shape [batch_size, num_true].
Sampled functions are only suitable for training. In test time, it's recommended to
use a standard softmax loss (either sparse or one-hot) to get an actual distribution.
Another alternative loss is tf.nn.nce_loss, which performs noise-contrastive estimation (if you're interested, see this very detailed discussion). I've included this function to the softmax family, because NCE guarantees approximation to softmax in the limit.
However, for version 1.5, softmax_cross_entropy_with_logits_v2 must be used instead, while using its argument with the argument key=..., for example
softmax_cross_entropy_with_logits_v2(_sentinel=None, labels=y,
logits=my_prediction, dim=-1, name=None)
While it is great that the accepted answer contains lot more info than what is asked, I felt that sharing a few generic thumb rules will make the answer more compact and intuitive:
There is just one real loss function. This is cross-entropy (CE). For a special case of a binary classification, this loss is called binary CE (note that the formula does not change) and for non-binary or multi-class situations the same is called categorical CE (CCE). Sparse functions are a special case of categorical CE where the expected values are not one-hot encoded but is an integer
We have the softmax formula which is an activation for multi-class scenario. For binary scenario, same formula is given a special name - sigmoid activation
Because there are sometimes numerical instabilities (for extreme values) when dealing with logarithmic functions, TF recommends combining the activation layer and the loss layer into one single function. This combined function is numerically more stable. TF provides these combined functions and they are suffixed with _with_logits
With this, let us now approach some situations. Say there is a simple binary classification problem - Is a cat present or not in the image? What is the choice of activation and loss function? It will be a sigmoid activation and a (binary)CE. So one could use sigmoid_cross_entropy or more preferably sigmoid_cross_entropy_with_logits. The latter combines the activation and the loss function and is supposed to be numerically stable.
How about a multi-class classification. Say we want to know if a cat or a dog or a donkey is present in the image. What is the choice of activation and loss function? It will be a softmax activation and a (categorical)CE. So one could use softmax_cross_entropy or more preferably softmax_cross_entropy_with_logits. We assume that the expected value is one-hot encoded (100 or 010 or 001). If (for some weird reason), this is not the case and the expected value is an integer (either 1 or 2 or 3) you could use the 'sparse' counterparts of the above functions.
There could be a third case. We could have a multi-label classification. So there could be a dog and a cat in the same image. How do we handle this? The trick here is to treat this situation as a multiple binary classification problems - basically cat or no cat / dog or no dog and donkey or no donkey. Find out the loss for each of the 3 (binary classifications) and then add them up. So essentially this boils down to using the sigmoid_cross_entropy_with_logits loss.
This answers the 3 specific questions you have asked. The functions shared above are all that are needed. You can ignore the tf.contrib family which is deprecated and should not be used.

What is the meaning of 'self.diff' in 'forward' of a custom python loss layer for Caffe training?

I try to use a custom python loss layer. When I checked several examples online, such as:
Euclidean loss layer, Dice loss layer,
I notice a variable 'self.diff' is always assigned in 'forward'. Especially for the Dice loss layer,
self.diff[...] = bottom[1].data
I wonder if there is any reason that this variable has to be introduced in forward or I can just use bottom[1].data to access ground truth label?
In addition, what is the point of top[0].reshape(1) in reshape, since by definition in forward, the loss output is a scalar itself.
You need to set the diff attribute of the layer for overall consistency and data communication protocol; it's available other places in the class, and anywhere the loss layer object appears. bottom is a local parameter, and is not available elsewhere in the same form.
In general, the code is expandable for a variety of applications and more complex computations; the reshaping is part of this, ensuring that the returned value is scalar, even if someone expands the inputs to work with vectors or matrices.

Categories

Resources