How does classifier predict a class? (for tensorflow/keras) - python

I want to implement the model by myself and thus have to know how does it classify the data. I build a model for 12-class classifier and it predicts fine. But the last conv layer just outputs 12 floating point value and I don't know how it suddenly predicts the right class.
Can someone explain for me? Like is it depend on some threshold or it chooses the max value or something? Thanks!

According to the documentation of SparseCategoricalAccuracy, it's equivalent to this computation:
acc = np.dot(sample_weight, np.equal(y_true, np.argmax(y_pred, axis=1))
This means that it calculates the frequency with which the maximum value per row of y_pred matches y_true. For instance:
m = tf.keras.metrics.SparseCategoricalAccuracy()
m.update_state([[2], [1]], [[0.1, 0.6, 0.3], # max at 1
[0.05, 0.95, 0]]) # max at 1
m.result().numpy()
0.5
Because [2] != [1] and [1] == [1] so 0.5 of the time, they are equal.

you need to add a flatten layer to your model followed by a dense layer. The dense layer should have 12 nodes and use the softmax activation as shown below' Your model will now output a list of 12 probability values for each image.
flatten=tf.keras.layers.Flatten()(last_conv_layer)
output = Dense(12, activation='softmax')(flatten)
#after you train you can evaluate your model on your test set using model.evaluate()
#to make Predictions use model.predict()
predictions=model.predict(.....
#you can get the index of the predicted class for the images you predict with
for p in predictions:
predicted_index=argmax(p)
print (predicted_index)
documentation for model.evaluate and model.predict is here. Don't forget to recompile your model after you add the two layers.

Related

Multiple outcome values for simple neural network. What activate function to use

Hi I'm trying to build a simple neural network with tensorflow, where I give the model the training_data, which contains the standard values and i give it the target_data, which is the result I want it to have if the predicted value is near one of those numbers.
For example, if I give the y_test a value of 3.5, the model would predict and give a number close to 4. So the condition would say it was a lightsmoker. I searched a bit for activation functions and I learned I can't use sigmoid for what I want to do. I'm quite new on this matter. What i've done so far it's by error and trial.
import random
import tensorflow as tf
import numpy as np
training_data=[]
for i in range(0,5):
training_data.append([random.uniform(0,0.2944)])
for i in range(0,5):
training_data.append([random.uniform(0.2944,1.7394)])
for i in range(0,5):
training_data.append([random.uniform(1.7394,3.2394)])
for i in range(0,5):
training_data.append([random.uniform(3.2394,6)])
target_data=[]
for i in range(0,5):
target_data.append([1])
for i in range(0,5):
target_data.append([2])
for i in range(0,5):
target_data.append([3])
for i in range(0,5):
target_data.append([4])
y_test= np.array([100])
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(len(target_data),input_dim=1,activation='softmax'))
model.add(tf.keras.layers.Dense(1,activation='relu'))
model.compile( loss='mean_squared_error',
optimizer='adam',
metrics=['accuracy'])
training_data = np.asarray(training_data)
target_data = np.asarray(target_data)
model.fit(training_data, target_data, epochs=50, verbose=0)
target_pred= model.predict(y_test)
target_pred=float(target_pred)
print("X=%s, Predicted=%s" % (y_test, target_pred))
if( 0<= target_pred <= 1.5):
print("\nNon-Smoker")
elif(1.5<= target_pred <2.5):
print("\nPassive Smoker")
elif(2.5<= target_pred <3.5 ):
print("Lghtsmoker")
else:
print("Smoker\n")
Here is a helpful guide to using activation functions in the final layer as well as corresponding losses for different type of problems.
In your case, I am assuming you are working with a regression task with arbitrary values (any float value as output, not restricted between 0 to 1 or -1 to 1). So, skip the activation function and keep mse or mean_squared_error as your loss function.
EDIT:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(3,input_shape=(1,),activation='relu'))
model.add(tf.keras.layers.Dense(1))
You are defining your problem as a regression problem where the result of model.predict is a linear value. For that kind of situation the last layer in your model is a linear layer that does not have an activation function. For this kind of problem your loss as mse is fine. Now you could elect to define your problem as a classification problem. Where you have 3 classes, Non-Smoker, Passive-Smoker and Light smoker. Now in that case, your target data in training is not a number in the numerical sense but an integer that indicates which class the training sample represents. For example you could have Non_Smoker with the label 0, Passive_Smoker with the label 1 and Light_Smoker with the label 2. Now the last layer in your model would use a softmax activation function. In model.compile your loss would be sparse_categorical_crossentropy because your labels are integers. If you one-hot encode your labels, for example Non_Smoker coded as 100, Light_Smoker as 010 and Passive_Smoker coded as 001 then your loss fuction would be categorical_cross_entropy. Now when you ran model.predict on a test sample it will produce a list containing 3 probabilities. The first in the list is the probability for class 0 - Non_Smoker, second is the probability for class 1 Light Smoker and the third is the probability of the third class Passive_Smoker. Now what you do is use np.argmax to find which index has the highest probability value and that is then the model's prediction.

how to interpret a probability predictions of a deep learning model that is an output of a sigmoid activation of last layer?

I have trained a binary classification task (pos. vs. neg.) and have a .h5 model. And I have external data (which was never used in training nor in the validation). There are 20 of samples overall belonging to both classes.
preds = model.predict(img)
y_classes = np.argmax(preds , axis=1)
The above code is supposed to calculate probability (preds) and class labels (0 or 1) if it were trained with softmax as the last output layer. But, preds is only a single number between [0;1] and y_classes is always 0.
To go back a little, the model was evaluated with mean AUC with the area being around 0.75.
I can see the probabilities of those 20 samples mostly (17) lie between 0 - 0.15, the rest are 0.74, 0.51 and 0.79.
How do I make a conclusion from this?
EDIT:
10 among 20 samples for testing the model belong to positive class, the other 10 belong to negative class. All 10 which belong to pos. class have very low prabability (0 - 0.15). 7 out 10 negative classes have the same low probability, only 3 being (0.74, 0.51 and 0.79).
The question: Why is the model predicting the samples with such a low probability even though its AUC was quite higher?
the sigmoid activation function is used to generate probabilities in binary classification problems. in this case, the model output an array of probabilities with shape equal to the length of images to predict. we can retrieve the predicted class simply checking the probability score... if it's above 0.5 (this is a common practice but u can also change it according to your needs) the image belongs to the class 1 else it belongs to the class 0.
preds = model.predict(img) # (n_images, 1)
y_classes = ((pred > 0.5)+0).ravel() # (n_images,)
in case of sigmoid, your last output layer must be Dense(1, activation='sigmoid')
in the case of softmax (as you have just done), the predicted class are retrieved using argmax
preds = model.predict(img) # (n_images, n_class)
y_classes = np.argmax(preds , axis=1) # (n_images,)
in case of softmax, your last output layer must be Dense(n_classes, activation='softmax')
WHY AUC IS NOT A GOOD METRIC
The value of AUC can be misleading and can cause us sometimes to overestimate and sometimes to underestimate the actual performance of a model. The behavior of Average-Precision is more expressive in getting a flavor of how the model is doing because it is more sensible in distinguishing between a good and a very good model. Moreover, it is directly linked to precision: an indicator which is human-understandable Here a great reference about the topics which explains all you need: https://towardsdatascience.com/why-you-should-stop-using-the-roc-curve-a46a9adc728
By using a sigmoid function as your activation function you are basically "compressing" the output of prior layers to a probability value from 0 to 1.
Softmax function is just taking a sequence of sigmoid functions, aggregates them and shows the ratio between a specific class probability and all aggregated probabilities for all classes.
For example: if I'm using a model to predict whether an image is an image of a banana, apple or grape, and my model recognizes that a certain image is 0.75 banana, 0.20 apple and 0.15 grape (Each probability is generated with a sigmoid function), my softmax layer will make this calculation:
banana: 0.75 / (0.75 + 0.20 + 0.15) = 0.6818 && apple: 0.20 / 1.1 = 0.1818 && grape: 0.15 / 1.1 = 0.1364.
As we can see, this model will classify this specific picture as a picture of a banana thanks to our softmax layer. Yet, in order to make this classification, it priorly used a series of sigmoid functions.
So if we finally reach to the point, I'd say that the interpretation of a sigmoid function output should be similar to the one that you'd make with a softmax layer, but while a softmax layer gives you the comparison between one class to another, a sigmoid function simply tells you how likely it is that this piece of information belongs to the positive class.
In order to make the final call and decide if a certain item does or doesn't belong to the positive class, you need to pick a threshold (not necessarily 0.5). Picking a threshold is the final step of your output interpretation. If you'd like to max the precision of your model, you will pick a high threshold, but if you'd like to max the recall of your model you can definitely pick a lower threshold.
I hope it answers your question, let me know if you'd like me to elaborate on anything as this answer is quite general.

Tensorflow 2.0 - do these model predictions represent probabilities?

I have a very simple Tensorflow 2 Keras model to do penalized logistic regression on some data. I was hoping to get the probabilties of each class, instead of just the predicted values of [0 or 1].
I think I got what I wanted, but just wanted to make sure that these numbers are what I think they are. I used the model.predict_on_batch() function from Tensorflow.keras, but the documentation just says that this provides a numpy array of predictions. However I believe I am getting probabilities, but I was hoping someone could confirm.
The model code looks like this:
feature_layer = tf.keras.layers.DenseFeatures(features)
model = tf.keras.Sequential([
feature_layer,
layers.Dense(1, activation='sigmoid', kernel_regularizer=tf.keras.regularizers.l1(0.01))
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
predictions = model.predict_on_batch(validation_dataset)
print('Predictions for a single batch.')
print(predictions)
So the predictions I am getting look like:
Predictions for a single batch.
tf.Tensor(
[[0.10916319]
[0.14546806]
[0.13057315]
[0.11713684]
[0.16197902]
[0.19613355]
[0.1388464 ]
[0.14122346]
[0.26149303]
[0.12516734]
[0.1388464 ]
[0.14595506]
[0.14595506]]
Now for predictions in a logistic regression that would be an array of either 0 or 1. But since I am getting floating point values. However, I am just getting a single value when there is actually a probability that the example is a 0 and the probability that the example is a 1. So I would imagine an array of 2 probabilities for each row or example. Of course, the Probability(Y = 0) + Probability(Y = 1) = 1, so this might just be some concise representation.
So again, do the values in the array below represent probabilities that the example or Y = 1, or something else?
The values represented here:
tf.Tensor(
[[0.10916319]
[0.14546806]
[0.13057315]
[0.11713684]
[0.16197902]
[0.19613355]
[0.1388464 ]
[0.14122346]
[0.26149303]
[0.12516734]
[0.1388464 ]
[0.14595506]
[0.14595506]]
Are the probabilities corresponding to each one of your classes.
Since you used sigmoid activation on your last layer, these will
be in the range [0, 1].
Your model is very shallow (few layers) and thus these prediction probabilities are very close between classes. I suggest you add more layers.
Conclusion
To answer your question, these are probabilities but only due to your activation function selection (sigmoid). If you used tanh activation these would be in range [-1,1].
Note that these probabilities are "binary" for each class due to the use of binary_crossentropy loss - aka 10.92% that class 1 is present and 89.08% that it is not, and so on for other classes. If you want the predictions to follow probabilistic rules (sum = 1) then you should consider categorical_crossentropy.

How to get prediction when computing loss function in convolutional neural network (tensorflow)?

I built a convolutional neural network with tensorflow by following these steps:
https://www.tensorflow.org/tutorials/estimators/cnn
I want to compute the loss with my own loss function and therefore need to get the predicted propabilities of each class in each training step.
From the Tensorflow tutorial I know that I can get these propabilities with "tf.nn.softmax(logits)", however this returns a tensor and I don't know how to extract the actual propabilities from this tensor. Can anyone please tell me how I can get these propabilities, so I can compute my loss function?
This is how you compute the softmax and get the probabilities afterwards:
# Probabities for each element in the batch for each class.
softmax = tf.nn.softmax(logits, axis=1)
# For each element in the batch return the element that has the maximal probability
predictions = tf.argmax(softmax, axis=1)
However, please note that you don't need the predictions in order to compute the loss function, you need the actuall probabilities. In case you want to compute other metrics then you can use the predictions (metrics such as accuracy, precision, recall and ect..). The softmax Tensor, contains the actual probabilities for each of your classes. For example, assuming that you have 2 elements in a batch, and you are trying to predict one out of three classes, the softmax will give you the following:
# Logits with random numbers
logits = np.array([[24, 23, 50], [50, 30, 32]], dtype=np.float32)
tf.nn.softmax(logits, axis=1)
# The softmax returns
# [[5.1090889e-12 1.8795289e-12 1.0000000e+00]
# [1.0000000e+00 2.0611537e-09 1.5229979e-08]]
# If we sum the probabilites for each batch they should sum up to one
tf.reduce_sum(softmax, axis=1)
# [1. 1.]
Based on how you imagine your loss function to be this should be correct:
first_second = tf.nn.l2_loss(softmax[0] - softmax[1])
first_third = tf.nn.l2_loss(softmax[0] - softmax[2])
divide_and_add_m = tf.divide(first_second, first_third) + m
loss = tf.maximum(0.0, 1 - tf.reduce_sum(divide_and_add_m))

What are logits? What is the difference between softmax and softmax_cross_entropy_with_logits?

In the tensorflow API docs they use a keyword called logits. What is it? A lot of methods are written like:
tf.nn.softmax(logits, name=None)
If logits is just a generic Tensor input, why is it named logits?
Secondly, what is the difference between the following two methods?
tf.nn.softmax(logits, name=None)
tf.nn.softmax_cross_entropy_with_logits(logits, labels, name=None)
I know what tf.nn.softmax does, but not the other. An example would be really helpful.
The softmax+logits simply means that the function operates on the unscaled output of earlier layers and that the relative scale to understand the units is linear. It means, in particular, the sum of the inputs may not equal 1, that the values are not probabilities (you might have an input of 5). Internally, it first applies softmax to the unscaled output, and then and then computes the cross entropy of those values vs. what they "should" be as defined by the labels.
tf.nn.softmax produces the result of applying the softmax function to an input tensor. The softmax "squishes" the inputs so that sum(input) = 1, and it does the mapping by interpreting the inputs as log-probabilities (logits) and then converting them back into raw probabilities between 0 and 1. The shape of output of a softmax is the same as the input:
a = tf.constant(np.array([[.1, .3, .5, .9]]))
print s.run(tf.nn.softmax(a))
[[ 0.16838508 0.205666 0.25120102 0.37474789]]
See this answer for more about why softmax is used extensively in DNNs.
tf.nn.softmax_cross_entropy_with_logits combines the softmax step with the calculation of the cross-entropy loss after applying the softmax function, but it does it all together in a more mathematically careful way. It's similar to the result of:
sm = tf.nn.softmax(x)
ce = cross_entropy(sm)
The cross entropy is a summary metric: it sums across the elements. The output of tf.nn.softmax_cross_entropy_with_logits on a shape [2,5] tensor is of shape [2,1] (the first dimension is treated as the batch).
If you want to do optimization to minimize the cross entropy AND you're softmaxing after your last layer, you should use tf.nn.softmax_cross_entropy_with_logits instead of doing it yourself, because it covers numerically unstable corner cases in the mathematically right way. Otherwise, you'll end up hacking it by adding little epsilons here and there.
Edited 2016-02-07:
If you have single-class labels, where an object can only belong to one class, you might now consider using tf.nn.sparse_softmax_cross_entropy_with_logits so that you don't have to convert your labels to a dense one-hot array. This function was added after release 0.6.0.
Short version:
Suppose you have two tensors, where y_hat contains computed scores for each class (for example, from y = W*x +b) and y_true contains one-hot encoded true labels.
y_hat = ... # Predicted label, e.g. y = tf.matmul(X, W) + b
y_true = ... # True label, one-hot encoded
If you interpret the scores in y_hat as unnormalized log probabilities, then they are logits.
Additionally, the total cross-entropy loss computed in this manner:
y_hat_softmax = tf.nn.softmax(y_hat)
total_loss = tf.reduce_mean(-tf.reduce_sum(y_true * tf.log(y_hat_softmax), [1]))
is essentially equivalent to the total cross-entropy loss computed with the function softmax_cross_entropy_with_logits():
total_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_hat, y_true))
Long version:
In the output layer of your neural network, you will probably compute an array that contains the class scores for each of your training instances, such as from a computation y_hat = W*x + b. To serve as an example, below I've created a y_hat as a 2 x 3 array, where the rows correspond to the training instances and the columns correspond to classes. So here there are 2 training instances and 3 classes.
import tensorflow as tf
import numpy as np
sess = tf.Session()
# Create example y_hat.
y_hat = tf.convert_to_tensor(np.array([[0.5, 1.5, 0.1],[2.2, 1.3, 1.7]]))
sess.run(y_hat)
# array([[ 0.5, 1.5, 0.1],
# [ 2.2, 1.3, 1.7]])
Note that the values are not normalized (i.e. the rows don't add up to 1). In order to normalize them, we can apply the softmax function, which interprets the input as unnormalized log probabilities (aka logits) and outputs normalized linear probabilities.
y_hat_softmax = tf.nn.softmax(y_hat)
sess.run(y_hat_softmax)
# array([[ 0.227863 , 0.61939586, 0.15274114],
# [ 0.49674623, 0.20196195, 0.30129182]])
It's important to fully understand what the softmax output is saying. Below I've shown a table that more clearly represents the output above. It can be seen that, for example, the probability of training instance 1 being "Class 2" is 0.619. The class probabilities for each training instance are normalized, so the sum of each row is 1.0.
Pr(Class 1) Pr(Class 2) Pr(Class 3)
,--------------------------------------
Training instance 1 | 0.227863 | 0.61939586 | 0.15274114
Training instance 2 | 0.49674623 | 0.20196195 | 0.30129182
So now we have class probabilities for each training instance, where we can take the argmax() of each row to generate a final classification. From above, we may generate that training instance 1 belongs to "Class 2" and training instance 2 belongs to "Class 1".
Are these classifications correct? We need to measure against the true labels from the training set. You will need a one-hot encoded y_true array, where again the rows are training instances and columns are classes. Below I've created an example y_true one-hot array where the true label for training instance 1 is "Class 2" and the true label for training instance 2 is "Class 3".
y_true = tf.convert_to_tensor(np.array([[0.0, 1.0, 0.0],[0.0, 0.0, 1.0]]))
sess.run(y_true)
# array([[ 0., 1., 0.],
# [ 0., 0., 1.]])
Is the probability distribution in y_hat_softmax close to the probability distribution in y_true? We can use cross-entropy loss to measure the error.
We can compute the cross-entropy loss on a row-wise basis and see the results. Below we can see that training instance 1 has a loss of 0.479, while training instance 2 has a higher loss of 1.200. This result makes sense because in our example above, y_hat_softmax showed that training instance 1's highest probability was for "Class 2", which matches training instance 1 in y_true; however, the prediction for training instance 2 showed a highest probability for "Class 1", which does not match the true class "Class 3".
loss_per_instance_1 = -tf.reduce_sum(y_true * tf.log(y_hat_softmax), reduction_indices=[1])
sess.run(loss_per_instance_1)
# array([ 0.4790107 , 1.19967598])
What we really want is the total loss over all the training instances. So we can compute:
total_loss_1 = tf.reduce_mean(-tf.reduce_sum(y_true * tf.log(y_hat_softmax), reduction_indices=[1]))
sess.run(total_loss_1)
# 0.83934333897877944
Using softmax_cross_entropy_with_logits()
We can instead compute the total cross entropy loss using the tf.nn.softmax_cross_entropy_with_logits() function, as shown below.
loss_per_instance_2 = tf.nn.softmax_cross_entropy_with_logits(y_hat, y_true)
sess.run(loss_per_instance_2)
# array([ 0.4790107 , 1.19967598])
total_loss_2 = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_hat, y_true))
sess.run(total_loss_2)
# 0.83934333897877922
Note that total_loss_1 and total_loss_2 produce essentially equivalent results with some small differences in the very final digits. However, you might as well use the second approach: it takes one less line of code and accumulates less numerical error because the softmax is done for you inside of softmax_cross_entropy_with_logits().
tf.nn.softmax computes the forward propagation through a softmax layer. You use it during evaluation of the model when you compute the probabilities that the model outputs.
tf.nn.softmax_cross_entropy_with_logits computes the cost for a softmax layer. It is only used during training.
The logits are the unnormalized log probabilities output the model (the values output before the softmax normalization is applied to them).
Mathematical motivation for term
When we wish to constrain an output between 0 and 1, but our model architecture outputs unconstrained values, we can add a normalisation layer to enforce this.
A common choice is a sigmoid function.1 In binary classification this is typically the logistic function, and in multi-class tasks the multinomial logistic function (a.k.a softmax).2
If we want to interpret the outputs of our new final layer as 'probabilities', then (by implication) the unconstrained inputs to our sigmoid must be inverse-sigmoid(probabilities). In the logistic case this is equivalent to the log-odds of our probability (i.e. the log of the odds) a.k.a. logit:
That is why the arguments to softmax is called logits in Tensorflow - because under the assumption that softmax is the final layer in the model, and the output p is interpreted as a probability, the input x to this layer is interpretable as a logit:
Generalised term
In Machine Learning there is a propensity to generalise terminology borrowed from maths/stats/computer science, hence in Tensorflow logit (by analogy) is used as a synonym for the input to many normalisation functions.
While it has nice properties such as being easily diferentiable, and the aforementioned probabilistic interpretation, it is somewhat arbitrary.
softmax might be more accurately called softargmax, as it is a smooth approximation of the argmax function.
Above answers have enough description for the asked question.
Adding to that, Tensorflow has optimised the operation of applying the activation function then calculating cost using its own activation followed by cost functions. Hence it is a good practice to use: tf.nn.softmax_cross_entropy() over tf.nn.softmax(); tf.nn.cross_entropy()
You can find prominent difference between them in a resource intensive model.
Tensorflow 2.0 Compatible Answer: The explanations of dga and stackoverflowuser2010 are very detailed about Logits and the related Functions.
All those functions, when used in Tensorflow 1.x will work fine, but if you migrate your code from 1.x (1.14, 1.15, etc) to 2.x (2.0, 2.1, etc..), using those functions result in error.
Hence, specifying the 2.0 Compatible Calls for all the functions, we discussed above, if we migrate from 1.x to 2.x, for the benefit of the community.
Functions in 1.x:
tf.nn.softmax
tf.nn.softmax_cross_entropy_with_logits
tf.nn.sparse_softmax_cross_entropy_with_logits
Respective Functions when Migrated from 1.x to 2.x:
tf.compat.v2.nn.softmax
tf.compat.v2.nn.softmax_cross_entropy_with_logits
tf.compat.v2.nn.sparse_softmax_cross_entropy_with_logits
For more information about migration from 1.x to 2.x, please refer this Migration Guide.
One more thing that I would definitely like to highlight as logit is just a raw output, generally the output of last layer. This can be a negative value as well. If we use it as it's for "cross entropy" evaluation as mentioned below:
-tf.reduce_sum(y_true * tf.log(logits))
then it wont work. As log of -ve is not defined.
So using o softmax activation, will overcome this problem.
This is my understanding, please correct me if Im wrong.
Logits are the unnormalized outputs of a neural network. Softmax is a normalization function that squashes the outputs of a neural network so that they are all between 0 and 1 and sum to 1. Softmax_cross_entropy_with_logits is a loss function that takes in the outputs of a neural network (after they have been squashed by softmax) and the true labels for those outputs, and returns a loss value.

Categories

Resources