What does predict (Keras) return? - python

I mean, I know what it returns, but what I really dont´t know is like the following example. It gives me this output: [0.238 0.762] in a model which has only binary outputs [0, 1].
So I know that it is the probabilities of each class to the input given, but what value corresponds to each class? [0, 1] or [1, 0]?

Predict returns the Neural Network outputs at the last layer. This is not necessarily the probabilities, but simply depends on what you used in your neural network architecture. The simple answer is that you can run
model.predict(x) > 0.5
That should work in most cases. The NN will optimize to approach the best solution, but all of the values within are continuous so unless you problem is very easily separable you will rarely get an output that is fully binary.
To answer your question, [0.238 0.762] unless trained strangely likely means [0,1]

For binary classification, the first column is the probability of class 0 and second is of class 1. You can check that in keras code base L257-L260 (here) from keras.wrappers.scikit_learn.KerasClassifier:
# check if binary classification
if probs.shape[1] == 1:
# first column is probability of class 0 and second is of class 1
probs = np.hstack([1 - probs, probs])

Related

Neural Network - convert model output to the predicted target class

I am building a neural network from scratch in python.
In the dataset I am using for testing, the features are all numeric (57 features) and the target variable is categorical (10 classes, already converted it to numeric from 0-9, but can use other encoding).
Everything seems to be working, except that I am quite stuck on how to compare my model output with the y_true value to compute the error. So I have 10 classes for the target variable and what I get as output is an array of 10-elements for each observation, instead of a unique value/classification for each sample.
Can someone give me a simple way to convert my output to a single y_predicted value that's comparable with the y_true?
I am trying not to use any libraries except for Numpy and pandas, so using Keras SparseCategoricalCrossentropy() is not an option.
So neural networks don't return the class labels as we know it. They return probabilities of the input belonging to all the classes. Naturally, these probabilities sum up too 1.
Suppose you have 4 classes: A, B, C and D.
if the true label of an input is B, the NN output will not looks like this:
output = [0, 1, 0, 0]
It will look like this:
output = [0.1, 0.6, 0.2, 0.1]
Where the chance of input belonging to class A is 10%, class B is 60%, class C is 20% and class D is 10%.
You can easily make sure that this array sums up to 1 by adding the array elements and dividing each element by the sum. Then, you can use numpy argmax to get the index of the highest probability.
If you don't need the probabilities, then you can skip converting the output such that the sum is 1 and directly apply numpy argmax to get the index of the class with the highest probability.

Pytorch - (Categorical) Cross Entropy Loss using one hot encoding and softmax

I'm looking for a cross entropy loss function in Pytorch that is like the CategoricalCrossEntropyLoss in Tensorflow.
My labels are one hot encoded and the predictions are the outputs of a softmax layer. For example (every sample belongs to one class):
targets = [0, 0, 1]
predictions = [0.1, 0.2, 0.7]
I want to compute the (categorical) cross entropy on the softmax values and do not take the max values of the predictions as a label and then calculate the cross entropy. Unfortunately, I did not find an appropriate solution since Pytorch's CrossEntropyLoss is not what I want and its BCELoss is also not exactly what I need (isn't it?).
Does anyone know which loss function to use in Pytorch or how to deal with it?
Many thanks in advance!
I thought Tensorflow's CategoricalCrossEntropyLoss was equivalent to PyTorch's CrossEntropyLoss but it seems not. The former takes OHEs while the latter takes labels as well. It seems, however, that the difference is:
torch.nn.CrossEntropyLoss is a combination of torch.nn.LogSoftmax and torch.nn.NLLLoss():
tf.keras.losses.CategoricalCrossEntropyLoss is something like:
Your predictions have already been through a softmax. So only the negative log-likelihood needs to be applied. Based on what was discussed here, you could try this:
class CategoricalCrossEntropyLoss(nn.Module):
def __init__(self):
super().__init__()
def forward(self, y_hat, y):
return F.nll_loss(y_hat.log(), y.argmax(dim=1))
Above the prediction vector is converted from one-hot-encoding to label with torch.Tensor.argmax.
If that's correct why not just use torch.nn.CrossEntropyLoss in the first place? You would just have to remove the softmax on your model's last layer and convert your targets labels.

how to interpret a probability predictions of a deep learning model that is an output of a sigmoid activation of last layer?

I have trained a binary classification task (pos. vs. neg.) and have a .h5 model. And I have external data (which was never used in training nor in the validation). There are 20 of samples overall belonging to both classes.
preds = model.predict(img)
y_classes = np.argmax(preds , axis=1)
The above code is supposed to calculate probability (preds) and class labels (0 or 1) if it were trained with softmax as the last output layer. But, preds is only a single number between [0;1] and y_classes is always 0.
To go back a little, the model was evaluated with mean AUC with the area being around 0.75.
I can see the probabilities of those 20 samples mostly (17) lie between 0 - 0.15, the rest are 0.74, 0.51 and 0.79.
How do I make a conclusion from this?
EDIT:
10 among 20 samples for testing the model belong to positive class, the other 10 belong to negative class. All 10 which belong to pos. class have very low prabability (0 - 0.15). 7 out 10 negative classes have the same low probability, only 3 being (0.74, 0.51 and 0.79).
The question: Why is the model predicting the samples with such a low probability even though its AUC was quite higher?
the sigmoid activation function is used to generate probabilities in binary classification problems. in this case, the model output an array of probabilities with shape equal to the length of images to predict. we can retrieve the predicted class simply checking the probability score... if it's above 0.5 (this is a common practice but u can also change it according to your needs) the image belongs to the class 1 else it belongs to the class 0.
preds = model.predict(img) # (n_images, 1)
y_classes = ((pred > 0.5)+0).ravel() # (n_images,)
in case of sigmoid, your last output layer must be Dense(1, activation='sigmoid')
in the case of softmax (as you have just done), the predicted class are retrieved using argmax
preds = model.predict(img) # (n_images, n_class)
y_classes = np.argmax(preds , axis=1) # (n_images,)
in case of softmax, your last output layer must be Dense(n_classes, activation='softmax')
WHY AUC IS NOT A GOOD METRIC
The value of AUC can be misleading and can cause us sometimes to overestimate and sometimes to underestimate the actual performance of a model. The behavior of Average-Precision is more expressive in getting a flavor of how the model is doing because it is more sensible in distinguishing between a good and a very good model. Moreover, it is directly linked to precision: an indicator which is human-understandable Here a great reference about the topics which explains all you need: https://towardsdatascience.com/why-you-should-stop-using-the-roc-curve-a46a9adc728
By using a sigmoid function as your activation function you are basically "compressing" the output of prior layers to a probability value from 0 to 1.
Softmax function is just taking a sequence of sigmoid functions, aggregates them and shows the ratio between a specific class probability and all aggregated probabilities for all classes.
For example: if I'm using a model to predict whether an image is an image of a banana, apple or grape, and my model recognizes that a certain image is 0.75 banana, 0.20 apple and 0.15 grape (Each probability is generated with a sigmoid function), my softmax layer will make this calculation:
banana: 0.75 / (0.75 + 0.20 + 0.15) = 0.6818 && apple: 0.20 / 1.1 = 0.1818 && grape: 0.15 / 1.1 = 0.1364.
As we can see, this model will classify this specific picture as a picture of a banana thanks to our softmax layer. Yet, in order to make this classification, it priorly used a series of sigmoid functions.
So if we finally reach to the point, I'd say that the interpretation of a sigmoid function output should be similar to the one that you'd make with a softmax layer, but while a softmax layer gives you the comparison between one class to another, a sigmoid function simply tells you how likely it is that this piece of information belongs to the positive class.
In order to make the final call and decide if a certain item does or doesn't belong to the positive class, you need to pick a threshold (not necessarily 0.5). Picking a threshold is the final step of your output interpretation. If you'd like to max the precision of your model, you will pick a high threshold, but if you'd like to max the recall of your model you can definitely pick a lower threshold.
I hope it answers your question, let me know if you'd like me to elaborate on anything as this answer is quite general.

Autoencoder for Tabular Data with Discrete Values

I want to use an autoencoder for dimension reduction in Keras. The input is a table with discrete values 0,1,2,3,4 (each of these numbers show a category) in the columns. Each subject has a label 0/1 to show sick/healthy. Now I have two questions:
Which activation function should I use in the last layer? Shall I use a combination of sigmoid and ReLU?
I don't know if this kind of input variables need normalization (and if the answer is yes, how?)
Which activation function should I use in the last layer? Shall I use a combination of sigmoid and ReLU?
The activation in the last layer should be sigmoid and use binary_crossentropy loss function for training.
I don't know if this kind of input variables need normalization (and if the answer is yes, how?)
It depends on the nature of discrete values you mentioned. As you know, inputs to a neural network represents the "intensity" of each neurons; higher values mean the neuron being more intensive/active. So, categorical values as input to a NN only makes sense if they map to a continuous range. For example if excellent=3, good=2, bad=1, terrible=0, it's okay to feed these values to a NN because it makes sense to calculate f(wx+b) (intensity of the neuron) as a value of 1.5 means somewhere between bad and good.
However if the categorical values are pure nomial values without any relationship between them (for example: apple=1, orange=2, banana=3), it really doen't make sense to calculate the f(wx+b). In this case what does value 1.5 mean? For this type of data as input to a NN you should convert them to a binary encoding. For example if you have only 3 fruits you can encode this way:
apple = [1, 0, 0]
orange = [0, 1, 0]
banana = [0, 0, 1]
For this binary conversion, Keras has an utility function: to_categorical.

TensorFlow: Sample Integers from Gumbel Softmax

I am implementing a program to sample integers from a categorical distribution, where each integer is associated with a probability. I need to ensure that this program is differentiable, so that back propagation can be applied. I found tf.contrib.distributions.RelaxedOneHotCategorical which is very close to what I am trying to achieve.
However, the sample method of this class returns a one-hot vector, instead of an integer. How to write a program that is both differentiable and returns an integer/scalar instead of a vector?
The reason that RelaxedOneHotCategorical is actually differentiable is connected to the fact that it returns a softmax vector of floats instead of the argmax int index. If all you want is the index of the maximal element, you might as well use Categorical.
You can do a dot product of the relaxed one hot vector with a vector of [1 2 3 4 ... n]. The result is going to give you the desired scalar.
For instance if your one hot vector is [0 0 0 1], then dot([0 0 0 1],[1 2 3 4]) will give you 4 which is what you are looking for.
You can't get what you want in a differentiable manner because argmax isn't differentiable, which is why the Gumbel-Softmax distribution was created in the first place. This allows you, for instance, to use the outputs of a language model as inputs to a discriminator in a generative adversarial network because the activation approaches a one-hot vector as the temperature changes.
If you simply need to retrieve the maximal element at inference or testing time, you can use tf.math.argmax. But there's no way to do that in a differentiable manner.

Categories

Resources