How to use Softmax Activation function within a Neural Network

How to use Softmax Activation function within a Neural Network - python

Understanding until now- An activation function is applied on the neuron.What goes inside the function is the sum of each(connected-neuron-value*connected-weights).A single value enters the function,single value is returned from it. The above understanding works fine with tanh and sigmoid . Now I know how softmax works and it sums the values and everything other related to it.What confuses me is that softmax takes an array of numbers, I start questioning what are the sources of these numbers which forms the array ?
The following picture gives more insight into the question

Softmax works on an entire layer of neurons, and must have all their values to compute each of their outputs.
The softmax function looks like softmax_i(v) = exp(v_i)/sum_j(exp(v_j)), where v would be your neuron values (in your image, [0.82, 1.21, 0.74]), and exp is just exp(x) = e^x. Thus, exp(v_i) would be [2.27, 3.35, 2.096]. Divide each of those values by the sum of the entire vector, and you get [0.29, 0.43, 0.27]. These are the activation outputs of your neurons.
This is useful because the values add up to 1 (forgive the rounding errors in the example above that sum to 0.99... you get the idea), and thus can be interpreted as probabilities, e.g., the probability that an image is one particular class (when it can only belong to one class). That's why the computation needs to know the values of the entire vector of neurons, and can't be computed if you only know the value of a single neuron.
Note that, because of this, you don't usually have another layer after the softmax. Usually, the softmax is applied as the activation on your output layer, not a middle layer like you show. That said, it's perfectly valid to build a network the way you show, you'll just have another weight layer going to your single output neuron, and you'll have no more guarantee about what that output value might be. A more typical architecture would be something 2 neurons -> 3 neurons (sigmoid) -> 4 neurons (softmax) and now you'll have the probability that your input value falls into one of four classes.

Related

Labels -1, 0 and 1 for classification in Tensorflow

I am trying to write a model that outputs a vector of length N consisting of labels -1,0 and 1. Each of the labels depicts one of three decisions for the system participants (wireless devices). So the vector depicts a system state that is then passed on to an optimization problem in the next step. Due to the fix problem formulation that is awaiting the output vector a selection of 0,1 and 2 instead is not possible.
After coming across this tanh function to supply the -1,0 and 1 values:
1.5 * backend.tanh(alpha * x) + 0.5 * (backend.tanh(-(3 / alpha) * x)) from here, I was wondering how exactly this output layer and the penultimate layer can be built to suply this vector of labels {-1,0,1}. I tried using the above function in the output layer in a simple Iris classificator. But this resulted in terrible accuracy compared to the one achieved with 0,1,2 and softmax output layer.
Thanks in advance,
with kind regards,
Yuka

It doesn't seem like the outputs are actually "numerically related", for lack of a better term. Meaning, the labels could just as well be "left", "right", "up". So I think your best bet is to have 3 output nodes in the final layer, with softmax activation function, with each of the three nodes representing each of the three labels, using a Cross entropy loss function.
If your training data currently has the target as -1/0/1, you should one-hot encode it so that each target is a vector of length 3. So label 0 might be [0,1,0]

Autoencoder for Tabular Data with Discrete Values

I want to use an autoencoder for dimension reduction in Keras. The input is a table with discrete values 0,1,2,3,4 (each of these numbers show a category) in the columns. Each subject has a label 0/1 to show sick/healthy. Now I have two questions:
Which activation function should I use in the last layer? Shall I use a combination of sigmoid and ReLU?
I don't know if this kind of input variables need normalization (and if the answer is yes, how?)

Which activation function should I use in the last layer? Shall I use a combination of sigmoid and ReLU?
The activation in the last layer should be sigmoid and use binary_crossentropy loss function for training.
I don't know if this kind of input variables need normalization (and if the answer is yes, how?)
It depends on the nature of discrete values you mentioned. As you know, inputs to a neural network represents the "intensity" of each neurons; higher values mean the neuron being more intensive/active. So, categorical values as input to a NN only makes sense if they map to a continuous range. For example if excellent=3, good=2, bad=1, terrible=0, it's okay to feed these values to a NN because it makes sense to calculate f(wx+b) (intensity of the neuron) as a value of 1.5 means somewhere between bad and good.
However if the categorical values are pure nomial values without any relationship between them (for example: apple=1, orange=2, banana=3), it really doen't make sense to calculate the f(wx+b). In this case what does value 1.5 mean? For this type of data as input to a NN you should convert them to a binary encoding. For example if you have only 3 fruits you can encode this way:
apple = [1, 0, 0]
orange = [0, 1, 0]
banana = [0, 0, 1]
For this binary conversion, Keras has an utility function: to_categorical.

How can I predict the expected value and the variance simultaneously with a neural network?

I'd like to use a neural network to predict a scalar value which is the sum of a function of the input values and a random value (I'm assuming gaussian distribution) whose variance also depends on the input values. Now I'd like to have a neural network that has two outputs - the first output should approximate the deterministic part - the function, and the second output should approximate the variance of the random part, depending on the input values. What loss function do I need to train such a network?
(It would be nice if there was an example with Python for Tensorflow, but I'm also interested in general answers. I'm also not quite clear how I could write something like in Python code - none of the examples I found so far show how to address individual outputs from the loss function.)

You can use dropout for that. With a dropout layer you can make several different predictions based on different settings of which nodes dropped out. Then you can simply count the outcomes and interpret the result as a measure for uncertainty.
For details, read:
Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. 2016.

Since I've found nothing simple to implement, I wrote something myself, that models that explicitly: here is a custom loss function that tries to predict mean and variance. It seems to work but I'm not quite sure how well that works out in practice, and I'd appreciate feedback. This is my loss function:
def meanAndVariance(y_true: tf.Tensor , y_pred: tf.Tensor) -> tf.Tensor :
"""Loss function that has the values of the last axis in y_true
approximate the mean and variance of each value in the last axis of y_pred."""
y_pred = tf.convert_to_tensor(y_pred)
y_true = math_ops.cast(y_true, y_pred.dtype)
mean = y_pred[..., 0::2]
variance = y_pred[..., 1::2]
res = K.square(mean - y_true) + K.square(variance - K.square(mean - y_true))
return K.mean(res, axis=-1)
The output dimension is twice the label dimension - mean and variance of each value in the label. The loss function consists of two parts: a mean squared error that has the mean approximate the mean of the label value, and the variance that approximates the difference of the value from the predicted mean.

When using dropout to estimate the uncertainty (or any other stochastic regularization method), make sure to also checkout our recent work on providing a sampling-free approximation of Monte-Carlo dropout.
https://arxiv.org/pdf/1908.00598.pdf
We essentially follow ur idea. Treat the activations as random variables and then propagate mean and variance using error propagation to the output layer. Consequently, we obtain two outputs - the mean and the variance.

Custom loss function implementation issue in keras

I am implementing a custom loss function in keras. The output of the model is 10 dimensional softmax layer. To calculate loss: first I need to find the index of y firing 1 and then subtract that value with true value. I'm doing the following:
from keras import backend as K
def diff_loss(y_true,y_pred):
# find the indices of neuron firing 1
true_ind=K.tf.argmax(y_true,axis=0)
pred_ind=K.tf.argmax(y_pred,axis=0)
# cast it to float32
x=K.tf.cast(true_ind,K.tf.float32)
y=K.tf.cast(pred_ind,K.tf.float32)
return K.abs(x-y)
but it gives error "raise ValueError("None values not supported.")
ValueError: None values not supported."
What's the problem here?

This happens because your function is not differentiable. It's made of constants.
There is simply no solution for this if you want argmax as result.
An approach to test
Since you're using "softmax", that means that only one class is correct (you don't have two classes at the same time).
And since you want index differences, maybe you could work with a single continuous result (continuous values are differentiable).
Work with only one output ranging from -0.5 to 9.5, and take the classes by rounding the result.
That way, you can have the last layer with only one unit:
lastLayer = Dense(1,activation = 'sigmoid', ....) #or another kind if it's not dense
And change the range with a lambda layer:
lambdaLayer = Lambda(lambda x: 10*x - 0.5)
Now your loss can be a simple 'mae' (mean absolute error).
The downside of this attempt is that the 'sigmoid' activation is not evenly distributed between the classes. Some classes will be more probable than others. But since it's important to have a limit, it seems at first the best idea.
This will only work if you classes follow a logical increasing sequence. (I guess they do, otherwise you'd not be trying that kind of loss, right?)

What are logits? What is the difference between softmax and softmax_cross_entropy_with_logits?

In the tensorflow API docs they use a keyword called logits. What is it? A lot of methods are written like:
tf.nn.softmax(logits, name=None)
If logits is just a generic Tensor input, why is it named logits?
Secondly, what is the difference between the following two methods?
tf.nn.softmax(logits, name=None)
tf.nn.softmax_cross_entropy_with_logits(logits, labels, name=None)
I know what tf.nn.softmax does, but not the other. An example would be really helpful.

The softmax+logits simply means that the function operates on the unscaled output of earlier layers and that the relative scale to understand the units is linear. It means, in particular, the sum of the inputs may not equal 1, that the values are not probabilities (you might have an input of 5). Internally, it first applies softmax to the unscaled output, and then and then computes the cross entropy of those values vs. what they "should" be as defined by the labels.
tf.nn.softmax produces the result of applying the softmax function to an input tensor. The softmax "squishes" the inputs so that sum(input) = 1, and it does the mapping by interpreting the inputs as log-probabilities (logits) and then converting them back into raw probabilities between 0 and 1. The shape of output of a softmax is the same as the input:
a = tf.constant(np.array([[.1, .3, .5, .9]]))
print s.run(tf.nn.softmax(a))
[[ 0.16838508 0.205666 0.25120102 0.37474789]]
See this answer for more about why softmax is used extensively in DNNs.
tf.nn.softmax_cross_entropy_with_logits combines the softmax step with the calculation of the cross-entropy loss after applying the softmax function, but it does it all together in a more mathematically careful way. It's similar to the result of:
sm = tf.nn.softmax(x)
ce = cross_entropy(sm)
The cross entropy is a summary metric: it sums across the elements. The output of tf.nn.softmax_cross_entropy_with_logits on a shape [2,5] tensor is of shape [2,1] (the first dimension is treated as the batch).
If you want to do optimization to minimize the cross entropy AND you're softmaxing after your last layer, you should use tf.nn.softmax_cross_entropy_with_logits instead of doing it yourself, because it covers numerically unstable corner cases in the mathematically right way. Otherwise, you'll end up hacking it by adding little epsilons here and there.
Edited 2016-02-07:
If you have single-class labels, where an object can only belong to one class, you might now consider using tf.nn.sparse_softmax_cross_entropy_with_logits so that you don't have to convert your labels to a dense one-hot array. This function was added after release 0.6.0.

Short version:
Suppose you have two tensors, where y_hat contains computed scores for each class (for example, from y = W*x +b) and y_true contains one-hot encoded true labels.
y_hat = ... # Predicted label, e.g. y = tf.matmul(X, W) + b
y_true = ... # True label, one-hot encoded
If you interpret the scores in y_hat as unnormalized log probabilities, then they are logits.
Additionally, the total cross-entropy loss computed in this manner:
y_hat_softmax = tf.nn.softmax(y_hat)
total_loss = tf.reduce_mean(-tf.reduce_sum(y_true * tf.log(y_hat_softmax), [1]))
is essentially equivalent to the total cross-entropy loss computed with the function softmax_cross_entropy_with_logits():
total_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_hat, y_true))
Long version:
In the output layer of your neural network, you will probably compute an array that contains the class scores for each of your training instances, such as from a computation y_hat = W*x + b. To serve as an example, below I've created a y_hat as a 2 x 3 array, where the rows correspond to the training instances and the columns correspond to classes. So here there are 2 training instances and 3 classes.
import tensorflow as tf
import numpy as np
sess = tf.Session()
# Create example y_hat.
y_hat = tf.convert_to_tensor(np.array([[0.5, 1.5, 0.1],[2.2, 1.3, 1.7]]))
sess.run(y_hat)
# array([[ 0.5, 1.5, 0.1],
# [ 2.2, 1.3, 1.7]])
Note that the values are not normalized (i.e. the rows don't add up to 1). In order to normalize them, we can apply the softmax function, which interprets the input as unnormalized log probabilities (aka logits) and outputs normalized linear probabilities.
y_hat_softmax = tf.nn.softmax(y_hat)
sess.run(y_hat_softmax)
# array([[ 0.227863 , 0.61939586, 0.15274114],
# [ 0.49674623, 0.20196195, 0.30129182]])
It's important to fully understand what the softmax output is saying. Below I've shown a table that more clearly represents the output above. It can be seen that, for example, the probability of training instance 1 being "Class 2" is 0.619. The class probabilities for each training instance are normalized, so the sum of each row is 1.0.
Pr(Class 1) Pr(Class 2) Pr(Class 3)
,--------------------------------------
Training instance 1 | 0.227863 | 0.61939586 | 0.15274114
Training instance 2 | 0.49674623 | 0.20196195 | 0.30129182
So now we have class probabilities for each training instance, where we can take the argmax() of each row to generate a final classification. From above, we may generate that training instance 1 belongs to "Class 2" and training instance 2 belongs to "Class 1".
Are these classifications correct? We need to measure against the true labels from the training set. You will need a one-hot encoded y_true array, where again the rows are training instances and columns are classes. Below I've created an example y_true one-hot array where the true label for training instance 1 is "Class 2" and the true label for training instance 2 is "Class 3".
y_true = tf.convert_to_tensor(np.array([[0.0, 1.0, 0.0],[0.0, 0.0, 1.0]]))
sess.run(y_true)
# array([[ 0., 1., 0.],
# [ 0., 0., 1.]])
Is the probability distribution in y_hat_softmax close to the probability distribution in y_true? We can use cross-entropy loss to measure the error.
We can compute the cross-entropy loss on a row-wise basis and see the results. Below we can see that training instance 1 has a loss of 0.479, while training instance 2 has a higher loss of 1.200. This result makes sense because in our example above, y_hat_softmax showed that training instance 1's highest probability was for "Class 2", which matches training instance 1 in y_true; however, the prediction for training instance 2 showed a highest probability for "Class 1", which does not match the true class "Class 3".
loss_per_instance_1 = -tf.reduce_sum(y_true * tf.log(y_hat_softmax), reduction_indices=[1])
sess.run(loss_per_instance_1)
# array([ 0.4790107 , 1.19967598])
What we really want is the total loss over all the training instances. So we can compute:
total_loss_1 = tf.reduce_mean(-tf.reduce_sum(y_true * tf.log(y_hat_softmax), reduction_indices=[1]))
sess.run(total_loss_1)
# 0.83934333897877944
Using softmax_cross_entropy_with_logits()
We can instead compute the total cross entropy loss using the tf.nn.softmax_cross_entropy_with_logits() function, as shown below.
loss_per_instance_2 = tf.nn.softmax_cross_entropy_with_logits(y_hat, y_true)
sess.run(loss_per_instance_2)
# array([ 0.4790107 , 1.19967598])
total_loss_2 = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_hat, y_true))
sess.run(total_loss_2)
# 0.83934333897877922
Note that total_loss_1 and total_loss_2 produce essentially equivalent results with some small differences in the very final digits. However, you might as well use the second approach: it takes one less line of code and accumulates less numerical error because the softmax is done for you inside of softmax_cross_entropy_with_logits().

tf.nn.softmax computes the forward propagation through a softmax layer. You use it during evaluation of the model when you compute the probabilities that the model outputs.
tf.nn.softmax_cross_entropy_with_logits computes the cost for a softmax layer. It is only used during training.
The logits are the unnormalized log probabilities output the model (the values output before the softmax normalization is applied to them).

Mathematical motivation for term
When we wish to constrain an output between 0 and 1, but our model architecture outputs unconstrained values, we can add a normalisation layer to enforce this.
A common choice is a sigmoid function.1 In binary classification this is typically the logistic function, and in multi-class tasks the multinomial logistic function (a.k.a softmax).2
If we want to interpret the outputs of our new final layer as 'probabilities', then (by implication) the unconstrained inputs to our sigmoid must be inverse-sigmoid(probabilities). In the logistic case this is equivalent to the log-odds of our probability (i.e. the log of the odds) a.k.a. logit:
That is why the arguments to softmax is called logits in Tensorflow - because under the assumption that softmax is the final layer in the model, and the output p is interpreted as a probability, the input x to this layer is interpretable as a logit:
Generalised term
In Machine Learning there is a propensity to generalise terminology borrowed from maths/stats/computer science, hence in Tensorflow logit (by analogy) is used as a synonym for the input to many normalisation functions.
While it has nice properties such as being easily diferentiable, and the aforementioned probabilistic interpretation, it is somewhat arbitrary.
softmax might be more accurately called softargmax, as it is a smooth approximation of the argmax function.

Above answers have enough description for the asked question.
Adding to that, Tensorflow has optimised the operation of applying the activation function then calculating cost using its own activation followed by cost functions. Hence it is a good practice to use: tf.nn.softmax_cross_entropy() over tf.nn.softmax(); tf.nn.cross_entropy()
You can find prominent difference between them in a resource intensive model.

Tensorflow 2.0 Compatible Answer: The explanations of dga and stackoverflowuser2010 are very detailed about Logits and the related Functions.
All those functions, when used in Tensorflow 1.x will work fine, but if you migrate your code from 1.x (1.14, 1.15, etc) to 2.x (2.0, 2.1, etc..), using those functions result in error.
Hence, specifying the 2.0 Compatible Calls for all the functions, we discussed above, if we migrate from 1.x to 2.x, for the benefit of the community.
Functions in 1.x:
tf.nn.softmax
tf.nn.softmax_cross_entropy_with_logits
tf.nn.sparse_softmax_cross_entropy_with_logits
Respective Functions when Migrated from 1.x to 2.x:
tf.compat.v2.nn.softmax
tf.compat.v2.nn.softmax_cross_entropy_with_logits
tf.compat.v2.nn.sparse_softmax_cross_entropy_with_logits
For more information about migration from 1.x to 2.x, please refer this Migration Guide.

One more thing that I would definitely like to highlight as logit is just a raw output, generally the output of last layer. This can be a negative value as well. If we use it as it's for "cross entropy" evaluation as mentioned below:
-tf.reduce_sum(y_true * tf.log(logits))
then it wont work. As log of -ve is not defined.
So using o softmax activation, will overcome this problem.
This is my understanding, please correct me if Im wrong.

Logits are the unnormalized outputs of a neural network. Softmax is a normalization function that squashes the outputs of a neural network so that they are all between 0 and 1 and sum to 1. Softmax_cross_entropy_with_logits is a loss function that takes in the outputs of a neural network (after they have been squashed by softmax) and the true labels for those outputs, and returns a loss value.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use Softmax Activation function within a Neural Network - python

Related

Labels -1, 0 and 1 for classification in Tensorflow

Autoencoder for Tabular Data with Discrete Values

How can I predict the expected value and the variance simultaneously with a neural network?

Custom loss function implementation issue in keras

What are logits? What is the difference between softmax and softmax_cross_entropy_with_logits?

Categories

Resources