So far I have used Keras Tensorflow to model image processing, NLP, time series prediction. Usually in case of having labels with multiple entries, so multiple categories the task was always to just predict to which class the sample belongs. So for example the list of possible classes was [car, human, airplane, flower, building]. So the final prediction was to which class the sample belongs - giving probabilities for each class. Usually in terms of a very confident prediction one class had a very high probability and the others very low.
Now I came across this Kaggle challenge: Toxic Comment Classification Challenge and in specific this implementation. I thought that this is a multi-label classification problem, as one sample can belong to different classes. And indeed, when I check the final prediction:
I can see that the first sample prediction has a very high probability for both toxic and obscene. With my knowledge so far when I applied a standard model to predict a class I would have predicted the probability to which of this class the sample belongs. So either class 1 or 2 or .... so I would have had - in case of a confident prediciton - a high probability for class toxic and low for the others - or in case of unconfident prediciton - 0.4x for toxic, 0.4x for obscene and small probability for the rest.
Now I was suprised of how the implementation was done resp. I do not understand the following:
How is a multi-label classification done (in opposite to the "usual" model)?
When checking the code I see the following model:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
x = GlobalMaxPool1D()(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
I understand that x = Dense(6, activation="sigmoid") results from having to predict 6 classes. Same would be with my knowledge so far. However, why is then a probability resulting for a mulit-label classification? Where is the implementation difference between multi-label classification and just predicting one-label out of different choices?
Is it the simple difference of using binary crossentropy and not (sparse) categorical crossentropy along with 6 classes? So that tells that we have a binary problem for each of the classes and it handles these 6 classes separately, so giving a probability for each class that the sample belongs to this class and therefore it can have high probability to belonging to different classes?
The loss function to be used is indeed the binary_crossentropy with a sigmoid activation.
The categorical_crossentropy is not suitable for multi-label problems, because in case of the multi-label problems, the labels are not mutually exclusive. Repeat the last sentence: the labels are not mutually exclusive.
This means that the presence of a label in the form [1,0,1,0,0,0] is correct. The categorical_crossentropy and softmax will always tend to favour one specific class, but this is not the case; just like you saw, a comment can be both toxic and obscene.
Now imagine photos with cats and dogs inside them. What happens if we have 2 dogs and 2 cats inside a photo? Is it a dog picture or a cat picture? It is actually a "both" picture! We definitely need a way to specify that multiple labels are pertained/related to a photo/label.
The rationale for using the binary_crossentropy and sigmoid for multi-label classification resides in the mathematical properties, in that each output needs to be treated as in independent Bernoulli distribution.
Therefore, the only correct solution is BCE + 'sigmoid'.
As you already found out, this is not a "classic" classification problem. For the classification problems you described in your text, softmax activation is commonly used to achieve the effect with high and low confidences which sum up to 1.
If you want to predict a binary problem, like for example "credit card fraud", you can choose between softmax activation in combination with 2 output neurons (fraud<-> non fraud) and a regression model with 1 output neuron. In the latter, the single neuron will output values in range [0,1] and a threshold is chosen, for example 0.5. All outputs <0.5 belong to class 0 and all >= 0.5 to class 1.
If you want to build a model which is able to predict multiple classes for one input, you should use the regression aproach (latter one) and a sigmoid activation function. This enables outputs like the one in your image.
To be honest, I am not sure if the "binary-crossentropy" is the correct loss for a problem like this.
Related
I have trained a binary classification task (pos. vs. neg.) and have a .h5 model. And I have external data (which was never used in training nor in the validation). There are 20 of samples overall belonging to both classes.
preds = model.predict(img)
y_classes = np.argmax(preds , axis=1)
The above code is supposed to calculate probability (preds) and class labels (0 or 1) if it were trained with softmax as the last output layer. But, preds is only a single number between [0;1] and y_classes is always 0.
To go back a little, the model was evaluated with mean AUC with the area being around 0.75.
I can see the probabilities of those 20 samples mostly (17) lie between 0 - 0.15, the rest are 0.74, 0.51 and 0.79.
How do I make a conclusion from this?
EDIT:
10 among 20 samples for testing the model belong to positive class, the other 10 belong to negative class. All 10 which belong to pos. class have very low prabability (0 - 0.15). 7 out 10 negative classes have the same low probability, only 3 being (0.74, 0.51 and 0.79).
The question: Why is the model predicting the samples with such a low probability even though its AUC was quite higher?
the sigmoid activation function is used to generate probabilities in binary classification problems. in this case, the model output an array of probabilities with shape equal to the length of images to predict. we can retrieve the predicted class simply checking the probability score... if it's above 0.5 (this is a common practice but u can also change it according to your needs) the image belongs to the class 1 else it belongs to the class 0.
preds = model.predict(img) # (n_images, 1)
y_classes = ((pred > 0.5)+0).ravel() # (n_images,)
in case of sigmoid, your last output layer must be Dense(1, activation='sigmoid')
in the case of softmax (as you have just done), the predicted class are retrieved using argmax
preds = model.predict(img) # (n_images, n_class)
y_classes = np.argmax(preds , axis=1) # (n_images,)
in case of softmax, your last output layer must be Dense(n_classes, activation='softmax')
WHY AUC IS NOT A GOOD METRIC
The value of AUC can be misleading and can cause us sometimes to overestimate and sometimes to underestimate the actual performance of a model. The behavior of Average-Precision is more expressive in getting a flavor of how the model is doing because it is more sensible in distinguishing between a good and a very good model. Moreover, it is directly linked to precision: an indicator which is human-understandable Here a great reference about the topics which explains all you need: https://towardsdatascience.com/why-you-should-stop-using-the-roc-curve-a46a9adc728
By using a sigmoid function as your activation function you are basically "compressing" the output of prior layers to a probability value from 0 to 1.
Softmax function is just taking a sequence of sigmoid functions, aggregates them and shows the ratio between a specific class probability and all aggregated probabilities for all classes.
For example: if I'm using a model to predict whether an image is an image of a banana, apple or grape, and my model recognizes that a certain image is 0.75 banana, 0.20 apple and 0.15 grape (Each probability is generated with a sigmoid function), my softmax layer will make this calculation:
banana: 0.75 / (0.75 + 0.20 + 0.15) = 0.6818 && apple: 0.20 / 1.1 = 0.1818 && grape: 0.15 / 1.1 = 0.1364.
As we can see, this model will classify this specific picture as a picture of a banana thanks to our softmax layer. Yet, in order to make this classification, it priorly used a series of sigmoid functions.
So if we finally reach to the point, I'd say that the interpretation of a sigmoid function output should be similar to the one that you'd make with a softmax layer, but while a softmax layer gives you the comparison between one class to another, a sigmoid function simply tells you how likely it is that this piece of information belongs to the positive class.
In order to make the final call and decide if a certain item does or doesn't belong to the positive class, you need to pick a threshold (not necessarily 0.5). Picking a threshold is the final step of your output interpretation. If you'd like to max the precision of your model, you will pick a high threshold, but if you'd like to max the recall of your model you can definitely pick a lower threshold.
I hope it answers your question, let me know if you'd like me to elaborate on anything as this answer is quite general.
I built a CNN audio classifier with 3 classes. My problem is that their are more than 3 classes, e.g. a fourth could be "noise".
So when I call be prediction the sum of these 3 classes is always 1.
prediction = model.predict([X])
Is it somehow possible to extract the accuracy of each class so the sum of these accuracies is less then 1?
If you use a softmax activation function you are forcing the outputs to sum to 1, thereby making a relative confidence score between your classes. Perhaps, without knowing more about your data and application, a "1 vs all" type scheme would work better for your purposes. For example, each class could have a sigmoid activation function and you could pick the highest prediction but if that prediction doesn't score high enough on a sensitivity threshold then none of the classes are predicted and as such is empty or implicitly "noise."
As part of a project for my studies I want to try and approximate a function f:R^m -> R^n using a Keras neural network (to which I am completely new). The network seems to be learning to some (indeed unsatisfactory) point. But the predictions of the network don't resemble the expected results in the slightest.
I have two numpy-arrays containing the training-data (the m-dimensional input for the function) and the training-labels (the n-dimensional expected output of the function). I use them for training my Keras model (see below), which seems to be learning on the provided data.
inputs = Input(shape=(m,))
hidden = Dense(100, activation='sigmoid')(inputs)
hidden = Dense(80, activation='sigmoid')(hidden)
outputs = Dense(n, activation='softmax')(hidden)
opti = tf.keras.optimizers.Adam(lr=0.001)
model = Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=opti,
loss='poisson',
metrics=['accuracy'])
model.fit(training_data, training_labels, verbose = 2, batch_size=32, epochs=30)
When I call the evaluate-method on my model with a set of test-data and a set of test-labels, I get an apparent accuracy of more than 50%. However, when I use the predict method, the predictions of the network do not resemble the expected results in the slightest. For example, the first ten entries of the expected output are:
[0., 0.08193582, 0.13141066, 0.13495408, 0.16852582, 0.2154705 ,
0.30517559, 0.32567417, 0.34073457, 0.37453226]
whereas the first ten entries of the predicted results are:
[3.09514281e-09, 2.20849714e-03, 3.84095078e-03, 4.99367528e-03,
6.06226595e-03, 7.18442770e-03, 8.96730460e-03, 1.03423093e-02, 1.16029680e-02, 1.31887039e-02]
Does this have something to do with the metrics I use? Could the results be normalized by Keras in some intransparent way? Have I just used the wrong kind of model for the problem I want to solve? What does 'accuracy' mean anyway?
Thank you in advance for your help, I am new to neural networks and have been stuck with this issue for several days.
The problem is with this line:
outputs = Dense(n, activation='softmax')(hidden)
We use softmax activation only in a classification problem, where we need a probability distribution over the classes as an output of the network. And so softmax makes ensures that the output sums to one and non zero (which is true in your case). But I don't think the problem at hand for you is a classification task, you are just trying to predict ten continuous target varaibles, so use a linear activation function instead. So modify the above line to something like this
outputs = Dense(n, activation='linear')(hidden)
I am building a classifying ANN with python and the Keras library. I am using training the NN on an imbalanced dataset with 3 different classes. Class 1 is about 7.5 times as prevalent as Classes 2 and 3. As remedy, I took the advice of this stackoverflow answer and set my class weights as such:
class_weight = {0 : 1,
1 : 6.5,
2: 7.5}
However, here is the problem: The ANN is predicting the 3 classes at equal rates!
This is not useful because the dataset is imbalanced, and predicting the outcomes as each having a 33% chance is inaccurate.
Here is the question: How do I deal with an imbalanced dataset so that the ANN does not predict Class 1 every time, but also so that the ANN does not predict the classes with equal probability?
Here is my code I am working with:
class_weight = {0 : 1,
1 : 6.5,
2: 7.5}
# Making the ANN
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
classifier = Sequential()
# Adding the input layer and the first hidden layer with dropout
classifier.add(Dense(activation = 'relu',
input_dim = 5,
units = 3,
kernel_initializer = 'uniform'))
#Randomly drops 0.1, 10% of the neurons in the layer.
classifier.add(Dropout(rate= 0.1))
#Adding the second hidden layer
classifier.add(Dense(activation = 'relu',
units = 3,
kernel_initializer = 'uniform'))
#Randomly drops 0.1, 10% of the neurons in the layer.
classifier.add(Dropout(rate = 0.1))
# Adding the output layer
classifier.add(Dense(activation = 'sigmoid',
units = 2,
kernel_initializer = 'uniform'))
# Compiling the ANN
classifier.compile(optimizer = 'adam',
loss = 'binary_crossentropy',
metrics = ['accuracy'])
# Fitting the ANN to the training set
classifier.fit(X_train, y_train, batch_size = 100, epochs = 100, class_weight = class_weight)
The most evident problem that I see with your model is that it is not properly structured for classification.
If your samples can belong to only one class at a time, then you should not overlook this fact by having a sigmoid activation as your last layer.
Ideally, the last layer of a classifier should output the probability of a sample belonging to a class, i.e. (in your case) an array [a, b, c] where a + b + c == 1..
If you use a sigmoid output, then the output [1, 1, 1] is a possible one, although it is not what you are after. This is also the reason why your model is not generalizing properly: given that you're not specifically training it to prefer "unbalanced" outputs (like [1, 0, 0]), it will defalut to predicting the average values that it sees during training, accounting for the reweighting.
Try changing the activation of your last layer to 'softmax' and the loss to 'catergorical_crossentropy':
# Adding the output layer
classifier.add(Dense(activation='softmax',
units=2,
kernel_initializer='uniform'))
# Compiling the ANN
classifier.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
If this doesn't work, see my other comment and get back to me with that info, but I'm pretty confident that this is the main problem.
Cheers
Imbalanced datasets (where classes are uneven or unequally distributed) are a prevalent problem in classification. For example, one class label has a very high number of observations, and the other has a pretty low number of observations. Significant causes of data imbalance include:
Faulty data collection
Domain peculiarity – when some domains have an imbalanced dataset.
Imbalanced datasets can create many problems in classification hence the need to improve datasets for robust models and improve performance.
Here are several methods to bring balance to imbalanced datasets:
Undersampling – works by resampling the majority class points in a dataset to match or make them equal to the minority class points. It brings equilibrium between the majority and minority classes so that the classifier gives equal importance to both classes. However, it’s important to note that undersampling may cause some loss of information hence some insignificant results.
Oversampling – Also known as upsampling, oversampling resamples the minority class to equal the total number of majority class points. It replicates the observations from minority class points to balance datasets.
Synthetic Minority Oversampling Technique – As the name suggests, the SMOTE technique uses oversampling to create artificial data points for minority classes. It creates new instances between the attributes of the minority class, which are synthesized from existing data.
Searching optimal value from a grid – This technique involves finding probabilities for a particular class label then finding the optimum threshold to map the possibilities to the correct class label.
Using the BalancedBaggingClassifier – The BalancedBaggingClassifier allows you to resample each subclass of a dataset before training a random estimator to create a balanced dataset.
Use different algorithms – Some algorithms aren’t effective in restoring balance in imbalanced datasets. Sometimes it’s wise to try different algorithms to stand a better chance at creating a balanced dataset and improving performance. For instance, you can employ regularization or penalized models to punish the wrong predictions on the minority class.
The effects of imbalanced datasets can be significant. Hopefully, one of the approaches above can help you get in the right direction.
To test which approach works best for you, I’d suggest using deepchecks, an awesome open python package for validating data and models quickly.
I doing Text Classification by Convolution Neural Network. I used health documents (ICD-9-CM code) for my project and I used the same model as dennybritz used but my data has 36 labels. I used one_hot encoding to encode my label.
Here is my problem, when I run data which has one label for each document my code the accuracy is perfect from 0.8 to 1. If I run data which has more than one labels, the accuracy is significantly reduced.
For example: a document has single label as "782.0": [0 0 1 0 ... 0],
a document has multiple label as "782.0 V13.09 593.5": [1 0 1 0 ... 1].
Could anyone suggest why this happen and how to improve it?
The label encoding seems correct. If you have multiple correct labels, [1 0 1 0 ... 1] looks totally fine. The loss function used in Denny's post is tf.nn.softmax_cross_entropy_with_logits, which is the loss function for a multi-class problem.
Computes softmax cross entropy between logits and labels.
Measures the probability error in discrete classification tasks in
which the classes are mutually exclusive (each entry is in exactly one class).
In multi-label problem, you should use tf.nn.sigmoid_cross_entropy_with_logits:
Computes sigmoid cross entropy given logits.
Measures the probability error in discrete classification tasks in which each class is independent and not mutually exclusive. For instance, one could perform multilabel classification where a picture can contain both an elephant and a dog at the same time.
The input to the loss function would be logits (WX) and targets (labels).
Fix the accuracy measure
In order to measure the accuracy correctly for a multi-label problem, the code below needs to be changed.
# Calculate Accuracy
with tf.name_scope("accuracy"):
correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
The logic of correct_predictions above is incorrect when you could have multiple correct labels. For example, say num_classes=4, and label 0 and 2 are correct. Thus your input_y=[1, 0, 1, 0]. The correct_predictions would need to break tie between index 0 and index 2. I am not sure how tf.argmax breaks tie but if it breaks the tie by choosing the smaller index, a prediction of label 2 is always considered wrong, which definitely hurt your accuracy measure.
Actually in a multi-label problem, precision and recall are better metrics than accuracy. Also you can consider using precision#k (tf.nn.in_top_k) to report classifier performance.