I doing Text Classification by Convolution Neural Network. I used health documents (ICD-9-CM code) for my project and I used the same model as dennybritz used but my data has 36 labels. I used one_hot encoding to encode my label.
Here is my problem, when I run data which has one label for each document my code the accuracy is perfect from 0.8 to 1. If I run data which has more than one labels, the accuracy is significantly reduced.
For example: a document has single label as "782.0": [0 0 1 0 ... 0],
a document has multiple label as "782.0 V13.09 593.5": [1 0 1 0 ... 1].
Could anyone suggest why this happen and how to improve it?
The label encoding seems correct. If you have multiple correct labels, [1 0 1 0 ... 1] looks totally fine. The loss function used in Denny's post is tf.nn.softmax_cross_entropy_with_logits, which is the loss function for a multi-class problem.
Computes softmax cross entropy between logits and labels.
Measures the probability error in discrete classification tasks in
which the classes are mutually exclusive (each entry is in exactly one class).
In multi-label problem, you should use tf.nn.sigmoid_cross_entropy_with_logits:
Computes sigmoid cross entropy given logits.
Measures the probability error in discrete classification tasks in which each class is independent and not mutually exclusive. For instance, one could perform multilabel classification where a picture can contain both an elephant and a dog at the same time.
The input to the loss function would be logits (WX) and targets (labels).
Fix the accuracy measure
In order to measure the accuracy correctly for a multi-label problem, the code below needs to be changed.
# Calculate Accuracy
with tf.name_scope("accuracy"):
correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
The logic of correct_predictions above is incorrect when you could have multiple correct labels. For example, say num_classes=4, and label 0 and 2 are correct. Thus your input_y=[1, 0, 1, 0]. The correct_predictions would need to break tie between index 0 and index 2. I am not sure how tf.argmax breaks tie but if it breaks the tie by choosing the smaller index, a prediction of label 2 is always considered wrong, which definitely hurt your accuracy measure.
Actually in a multi-label problem, precision and recall are better metrics than accuracy. Also you can consider using precision#k (tf.nn.in_top_k) to report classifier performance.
Related
I have trained a binary classification task (pos. vs. neg.) and have a .h5 model. And I have external data (which was never used in training nor in the validation). There are 20 of samples overall belonging to both classes.
preds = model.predict(img)
y_classes = np.argmax(preds , axis=1)
The above code is supposed to calculate probability (preds) and class labels (0 or 1) if it were trained with softmax as the last output layer. But, preds is only a single number between [0;1] and y_classes is always 0.
To go back a little, the model was evaluated with mean AUC with the area being around 0.75.
I can see the probabilities of those 20 samples mostly (17) lie between 0 - 0.15, the rest are 0.74, 0.51 and 0.79.
How do I make a conclusion from this?
EDIT:
10 among 20 samples for testing the model belong to positive class, the other 10 belong to negative class. All 10 which belong to pos. class have very low prabability (0 - 0.15). 7 out 10 negative classes have the same low probability, only 3 being (0.74, 0.51 and 0.79).
The question: Why is the model predicting the samples with such a low probability even though its AUC was quite higher?
the sigmoid activation function is used to generate probabilities in binary classification problems. in this case, the model output an array of probabilities with shape equal to the length of images to predict. we can retrieve the predicted class simply checking the probability score... if it's above 0.5 (this is a common practice but u can also change it according to your needs) the image belongs to the class 1 else it belongs to the class 0.
preds = model.predict(img) # (n_images, 1)
y_classes = ((pred > 0.5)+0).ravel() # (n_images,)
in case of sigmoid, your last output layer must be Dense(1, activation='sigmoid')
in the case of softmax (as you have just done), the predicted class are retrieved using argmax
preds = model.predict(img) # (n_images, n_class)
y_classes = np.argmax(preds , axis=1) # (n_images,)
in case of softmax, your last output layer must be Dense(n_classes, activation='softmax')
WHY AUC IS NOT A GOOD METRIC
The value of AUC can be misleading and can cause us sometimes to overestimate and sometimes to underestimate the actual performance of a model. The behavior of Average-Precision is more expressive in getting a flavor of how the model is doing because it is more sensible in distinguishing between a good and a very good model. Moreover, it is directly linked to precision: an indicator which is human-understandable Here a great reference about the topics which explains all you need: https://towardsdatascience.com/why-you-should-stop-using-the-roc-curve-a46a9adc728
By using a sigmoid function as your activation function you are basically "compressing" the output of prior layers to a probability value from 0 to 1.
Softmax function is just taking a sequence of sigmoid functions, aggregates them and shows the ratio between a specific class probability and all aggregated probabilities for all classes.
For example: if I'm using a model to predict whether an image is an image of a banana, apple or grape, and my model recognizes that a certain image is 0.75 banana, 0.20 apple and 0.15 grape (Each probability is generated with a sigmoid function), my softmax layer will make this calculation:
banana: 0.75 / (0.75 + 0.20 + 0.15) = 0.6818 && apple: 0.20 / 1.1 = 0.1818 && grape: 0.15 / 1.1 = 0.1364.
As we can see, this model will classify this specific picture as a picture of a banana thanks to our softmax layer. Yet, in order to make this classification, it priorly used a series of sigmoid functions.
So if we finally reach to the point, I'd say that the interpretation of a sigmoid function output should be similar to the one that you'd make with a softmax layer, but while a softmax layer gives you the comparison between one class to another, a sigmoid function simply tells you how likely it is that this piece of information belongs to the positive class.
In order to make the final call and decide if a certain item does or doesn't belong to the positive class, you need to pick a threshold (not necessarily 0.5). Picking a threshold is the final step of your output interpretation. If you'd like to max the precision of your model, you will pick a high threshold, but if you'd like to max the recall of your model you can definitely pick a lower threshold.
I hope it answers your question, let me know if you'd like me to elaborate on anything as this answer is quite general.
i made a neural network with keras in python and cannot really understand what the loss function means.
So here first some general information:
i worked with the poker hand dataset with classes 0-9, which i wrote as vectors with the OneHotEncoding. I used the softmax activation in the last layer, so my output tells me for each of the 10 entries in a vector the probability if the sample belongs to a certain class. For example:
my real input it (0,1,0,0,0,0,0,0,0,0), which means class 1 (from 0-9 means from no card to royal flush), and class 1 means one pair (if you know poker).
With the neural net, it get at the and Outputs like (0.4, 0.2, 0.1, 0.1, 0.2, 0,0,0,0,0), which means that my sample belongs with 40 percent to class 0, with 20 percent to class 1 and so on!
Allright! i used also the binary cross_entropy as loss, the accuracy-metrics and the RMSprop-Optimizer.
When i use mode.evaluate() from keras, i got something like 0.16 for the loss and i do not know how to interpret this.
Does this mean, that in average, my predictions deviate 0.16 from the true? so if my prediction for class 0 is 0.5, it also could be 0.66 or 0.34?
Or how can i interpret it?
Please send help!
First at all, according to your problem definition you have a multi-class problem. Thus, you should use categorical_crossentropy. Binary cross_entropy is for two-class problems or for multi-label classification.
But generally the value of the loss function has a relative impact value. First at all, you have to understand what the cross_entropy is meaning. The formula is:
where
c is the correct classification of observation o and
y is the binary indicator (0 or 1) if class label c is the correct classification for observation o and p is the predicted probability that o is of class c.
For binary cross entropy, M is equal to 2. For categorical cross entropy, M>2.
Therefore, the cross entropy decreases if the predicted probability converges to the actual label:
Now let's take your example, where you have 10 classes and your real input is: (0,1,0,0,0,0,0,0,0,0).
If you have a loss of 0.16, it means that
which means that your model has assigned 0.85 to the correct label.
Therefore, the loss function gives you the log of the correct classification probability. As in keras the loss is computed on whole batches, it is the average of the log of the correct classification probability of the whole data in the specific batch. If you use the evaluate function, then it is the average of the log of the correct classification probability of the whole data you are evaluating.
I have a data with 4000 CNN features and it is a binary classification problem. All I know about the test data is the proportions of 1 and 0. How can I tell to my model to predict test labels by using the proportions data ? (Like is there a way to say in order to reach this proportions I will give this instance 0.)
How can I use it to increase accuracy ? In my case the training data is mostly consist of 1 (85%) and 0(15%)
However in my test data proportion of l is given as (%38) So it is much different than training data.
I worked a little bit with balancing the data and it helped. However my model still predicts 1 for nearly all of the data. It may occur because of the adaptation problem also.
As #birdwatch suggested I decrease the threshold for the 0 value and try to increase the 0 label count on the prediction.
# Predicting the Test set results
y_pred = classifier.predict_proba(X_test)
threshold=0.3
y_pred [:,0] = (y_pred [:,0] < threshold).astype('int')
Before the number of classes were as in follows:
1 : 8906
0 : 2968
After changing threshold now it is
1 : 3221
0 : 8653
However is there any other way that I can use test_proportions which ensures the result?
There isn't any sensible way to that. Doing so would create a weird bias in the model. One thing you could do is accept the less likely outcome only is it has high enough score. Normally you'd use 0.5 threshold, but here you might take e.g. 0.7.
I built a CNN audio classifier with 3 classes. My problem is that their are more than 3 classes, e.g. a fourth could be "noise".
So when I call be prediction the sum of these 3 classes is always 1.
prediction = model.predict([X])
Is it somehow possible to extract the accuracy of each class so the sum of these accuracies is less then 1?
If you use a softmax activation function you are forcing the outputs to sum to 1, thereby making a relative confidence score between your classes. Perhaps, without knowing more about your data and application, a "1 vs all" type scheme would work better for your purposes. For example, each class could have a sigmoid activation function and you could pick the highest prediction but if that prediction doesn't score high enough on a sensitivity threshold then none of the classes are predicted and as such is empty or implicitly "noise."
I have a dataset of the following form: A series of M observations of N-dimensional data. In order to obtain latent factors from this data, I wish to make a single hidden-layer autoencoder trained on this data. Every dimension of a single observation is either a 0 or a 1. But the keras Model returns floats. Is there a way to add a layer to enforce a 0 or 1 as output?
I tried using a simple keras Model to solve this problem. It claims good accuracy on the data, but when looking at the raw data it predicts the 0's correctly and often completely ignores the 1's.
n_nodes = 50
input_1 = tf.keras.layers.Input(shape=(x_train.shape[1],))
x = tf.keras.layers.Dense(n_nodes, activation='relu')(input_1)
output_1 = tf.keras.layers.Dense(x_train.shape[1], activation='sigmoid')(x)
model = tf.keras.models.Model(input_1, output_1)
my_optimizer = tf.keras.optimizers.RMSprop()
my_optimizer.lr = 0.002
model.compile(optimizer=my_optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10000)
predictions = model.predict(x_test)
These observations I then validate by looking at all experiments and seeing if a large (>0.1) value is returned for the elements which are 1. The performance is very poor on the 1's.
I have seen that the loss converges around 10000 epochs. However, the autoencoder fails to properly predict almost all 1's in the data set. Even when setting the width of the hidden layer to be identical to the dimensionality of the data (n_nodes = x_train.shape[1]) the autoencoder still gives bad performance, even worsening if I increase the width of the hidden layer.
[0, 1] outputs should generally be rounded such that >=0.5 rounds to 1 when outputting a final prediction and <0.5 rounds to 0. However your labels should be float values {0.0, 1.0} for the loss function (which I expect they are already). You can compute accuracy by rounding the outputs and comparing to your binary labels to count errors for {0, 1}, but they must be in continuous form [0.0, 1.0] for the loss and gradient calculations to work.
If you are doing all of that (and it does appear that things are set up correctly in your code), there might be a number of reasons for poor performance:
Your dense, "constriction" layer should be significantly smaller than your input. In making it smaller you are forcing the auto-encoder to learn a representative form of the input that can be used to produce the output. This representative form is likely to generalize well. If you increase the size of your hidden layer the network will have much more capacity to memorize the inputs.
You might have many more 0 values than 1 values, if this is the case then in the absence of actual learning the network could get stuck just predicting 0 as a "best guess" because that's "usually right". This is a harder problem to tackle. You might consider multiplying the loss by a vector of labels * eta + 1, this would effectively increase the learning rate of the ones labels. Example: Your labels are [0, 1, 0], eta is a hyper-parameter value >1, let's say eta=2.0. labels * eta = [1.0, 3.0, 1.0] which scales up the gradient signal for 1 values by increasing the loss for only 1's. This isn't a bullet proof method of increasing the importance of the 1's class, but it's something simple to try. If it makes any improvement then follow up on this line of reasoning in more detail.
You have 1 hidden layer, which means your limited to linear relationships, you might try 3 hidden layers to add a little non linearity. Your center layer should be fairly small, try something like 5 or 10 neurons, it should need to squeeze the data into a fairly tight constriction point to extract a general purpose representation.