I am currently trying to vary the threshold of a Random Forest Classifier in order to plot a ROC Curve. I was under the impression that the only way to do this for a Random Forest is through the use of the class_weight parameter. I have been able to do this successfully, increasing and decreasing precision, recall, true positive and false positive rates; however, I am not sure what I am actually doing. Currently I have the following;
rfc = RandomForestClassifier(n_jobs=-1, oob_score=True, n_estimators=50,max_depth=40,min_samples_split=100,min_samples_leaf=80, class_weight={0:.4, 1:.9})
What is the .4 and .9 actually referring too. I thought it was 40% of data set is 0's and 90% 1's however, this obviously makes no sense (over %100). What is it actually doing? THANKS!
Class weights typically do not need to normalise to 1 (it's only the ratio of the class weights that is important, so demanding that they sum to 1 would not actually be a restriction though).
So setting the class weights to 0.4 and 0.9 is equivalent to assuming a split of class labels in the data of 0.4 / (0.4+0.9) to 0.9 / (0.4+0.9) [roughly ~30% belonging to class 0 and ~70% belonging to class 1].
An alternative way to view differing class weights is as a way of more strongly penalising mistakes in one class compared to another, but still assuming balanced numbers of labelings in the data. In your example, it would be 9/4 times worse to misclassify a 1 as a 0 than it would be to misclassify a 0 as a 1.
The easiest (in my experience) way to vary the discrimination threshold of any of the scikit-learn classifiers is to use the predict_proba() function. Rather than returning a single output class, this returns the probabilities for membership in each class (concretely what it is doing is outputting the proportion of samples in the leaf nodes reached during the classification, averaged over all trees in the random forest.) Once you have these probabilities, it is easy to implement your own final classification step by comparing the probability for each class to some threshold which you can change.
probs = RF.predict_proba(X) # output dimension: [num_samples x num_classes]
for threshold in range(0,100):
threshold = threshold / 100.0
classes = (probs > threshold).astype(int)
# further analysis here as desired
Related
I'm new to ML and I've been working with an imbalanced data set where the count of negative samples is twice that of the positive samples. In-order to address these i set scikit-learn Random forest class_weight = 'balanced', which gave me an ROC-AUC score of 0.904 and the recall for class- 1 was 0.86, now when i tried to further improve the AUC Score by assigning weight, there wasn't any major difference with the results, i.e Class_weight = {0: 0.5, 1: 2.75}, assuming this would penalize for every wrong classification of 1 but it didn't seem to work as expected.
randomForestClf = RandomForestClassifier(random_state = 42, class_weight = {0: 0.5, 1:2.75})
Tried different values but has no major impact as Recall of 1 remains the same or reduces (0.85) and auc value is quite insignificant (0.90122). It only seems to work when one of the label is set 0.
Further tried to set the sample weights too. But that didn't seem to work either.
# Sample weights
class_weights = [0.5, 2]
weights = np.ones(y_train.shape[0], dtype = 'float')
for i, val in enumerate(y_train):
weights[i] = class_weights[val]
Below is the reference to a similar question but the solutions provided didn't work for me.
sklearn RandomForestClassifier's class_weights seems to have no effect
Is there anything that i'm missing out?
Thanks!
The reason is that you grow the trees out fully, which leads to every leaf node being pure. That will happen regardless of the class weights (though the structure of the tree leading up to those pure nodes will change). The predicted probabilities of each tree will be (almost) all 0 or 1, and so the overall probability estimates are just driven by disagreements between the trees.
If you set e.g. max_depth=10 (or whatever tree complexity parameter you like), now many/most of the leaf nodes will not be pure. Setting larger positive-class weights will produce leaf values that are biased toward the positive class (but still aren't just 0 and 1), and so the probability estimates will be skewed higher across the board, leading to a higher recall (at the expense of precision, presumably).
The ROC curve is relatively unaffected by class balance and the skewed-higher probabilities arising from the larger weights, and so shouldn't be heavily affected by changing weights, for a fixed max_depth.
I have trained a binary classification task (pos. vs. neg.) and have a .h5 model. And I have external data (which was never used in training nor in the validation). There are 20 of samples overall belonging to both classes.
preds = model.predict(img)
y_classes = np.argmax(preds , axis=1)
The above code is supposed to calculate probability (preds) and class labels (0 or 1) if it were trained with softmax as the last output layer. But, preds is only a single number between [0;1] and y_classes is always 0.
To go back a little, the model was evaluated with mean AUC with the area being around 0.75.
I can see the probabilities of those 20 samples mostly (17) lie between 0 - 0.15, the rest are 0.74, 0.51 and 0.79.
How do I make a conclusion from this?
EDIT:
10 among 20 samples for testing the model belong to positive class, the other 10 belong to negative class. All 10 which belong to pos. class have very low prabability (0 - 0.15). 7 out 10 negative classes have the same low probability, only 3 being (0.74, 0.51 and 0.79).
The question: Why is the model predicting the samples with such a low probability even though its AUC was quite higher?
the sigmoid activation function is used to generate probabilities in binary classification problems. in this case, the model output an array of probabilities with shape equal to the length of images to predict. we can retrieve the predicted class simply checking the probability score... if it's above 0.5 (this is a common practice but u can also change it according to your needs) the image belongs to the class 1 else it belongs to the class 0.
preds = model.predict(img) # (n_images, 1)
y_classes = ((pred > 0.5)+0).ravel() # (n_images,)
in case of sigmoid, your last output layer must be Dense(1, activation='sigmoid')
in the case of softmax (as you have just done), the predicted class are retrieved using argmax
preds = model.predict(img) # (n_images, n_class)
y_classes = np.argmax(preds , axis=1) # (n_images,)
in case of softmax, your last output layer must be Dense(n_classes, activation='softmax')
WHY AUC IS NOT A GOOD METRIC
The value of AUC can be misleading and can cause us sometimes to overestimate and sometimes to underestimate the actual performance of a model. The behavior of Average-Precision is more expressive in getting a flavor of how the model is doing because it is more sensible in distinguishing between a good and a very good model. Moreover, it is directly linked to precision: an indicator which is human-understandable Here a great reference about the topics which explains all you need: https://towardsdatascience.com/why-you-should-stop-using-the-roc-curve-a46a9adc728
By using a sigmoid function as your activation function you are basically "compressing" the output of prior layers to a probability value from 0 to 1.
Softmax function is just taking a sequence of sigmoid functions, aggregates them and shows the ratio between a specific class probability and all aggregated probabilities for all classes.
For example: if I'm using a model to predict whether an image is an image of a banana, apple or grape, and my model recognizes that a certain image is 0.75 banana, 0.20 apple and 0.15 grape (Each probability is generated with a sigmoid function), my softmax layer will make this calculation:
banana: 0.75 / (0.75 + 0.20 + 0.15) = 0.6818 && apple: 0.20 / 1.1 = 0.1818 && grape: 0.15 / 1.1 = 0.1364.
As we can see, this model will classify this specific picture as a picture of a banana thanks to our softmax layer. Yet, in order to make this classification, it priorly used a series of sigmoid functions.
So if we finally reach to the point, I'd say that the interpretation of a sigmoid function output should be similar to the one that you'd make with a softmax layer, but while a softmax layer gives you the comparison between one class to another, a sigmoid function simply tells you how likely it is that this piece of information belongs to the positive class.
In order to make the final call and decide if a certain item does or doesn't belong to the positive class, you need to pick a threshold (not necessarily 0.5). Picking a threshold is the final step of your output interpretation. If you'd like to max the precision of your model, you will pick a high threshold, but if you'd like to max the recall of your model you can definitely pick a lower threshold.
I hope it answers your question, let me know if you'd like me to elaborate on anything as this answer is quite general.
I'm new to ML and I've been working with an imbalanced data set where the count of negative samples is twice that of the positive samples. In-order to address these i set scikit-learn Random forest class_weight = 'balanced', which gave me an ROC-AUC score of 0.904 and the recall for class- 1 was 0.86, now when i tried to further improve the AUC Score by assigning weight, there wasn't any major difference with the results, i.e Class_weight = {0: 0.5, 1: 2.75}, assuming this would penalize for every wrong classification of 1 but it didn't seem to work as expected.
randomForestClf = RandomForestClassifier(random_state = 42, class_weight = {0: 0.5, 1:2.75})
Tried different values but has no major impact as Recall of 1 remains the same or reduces (0.85) and auc value is quite insignificant (0.90122). It only seems to work when one of the label is set 0.
Further tried to set the sample weights too. But that didn't seem to work either.
# Sample weights
class_weights = [0.5, 2]
weights = np.ones(y_train.shape[0], dtype = 'float')
for i, val in enumerate(y_train):
weights[i] = class_weights[val]
Below is the reference to a similar question but the solutions provided didn't work for me.
sklearn RandomForestClassifier's class_weights seems to have no effect
Is there anything that i'm missing out?
Thanks!
The reason is that you grow the trees out fully, which leads to every leaf node being pure. That will happen regardless of the class weights (though the structure of the tree leading up to those pure nodes will change). The predicted probabilities of each tree will be (almost) all 0 or 1, and so the overall probability estimates are just driven by disagreements between the trees.
If you set e.g. max_depth=10 (or whatever tree complexity parameter you like), now many/most of the leaf nodes will not be pure. Setting larger positive-class weights will produce leaf values that are biased toward the positive class (but still aren't just 0 and 1), and so the probability estimates will be skewed higher across the board, leading to a higher recall (at the expense of precision, presumably).
The ROC curve is relatively unaffected by class balance and the skewed-higher probabilities arising from the larger weights, and so shouldn't be heavily affected by changing weights, for a fixed max_depth.
I have a data with 4000 CNN features and it is a binary classification problem. All I know about the test data is the proportions of 1 and 0. How can I tell to my model to predict test labels by using the proportions data ? (Like is there a way to say in order to reach this proportions I will give this instance 0.)
How can I use it to increase accuracy ? In my case the training data is mostly consist of 1 (85%) and 0(15%)
However in my test data proportion of l is given as (%38) So it is much different than training data.
I worked a little bit with balancing the data and it helped. However my model still predicts 1 for nearly all of the data. It may occur because of the adaptation problem also.
As #birdwatch suggested I decrease the threshold for the 0 value and try to increase the 0 label count on the prediction.
# Predicting the Test set results
y_pred = classifier.predict_proba(X_test)
threshold=0.3
y_pred [:,0] = (y_pred [:,0] < threshold).astype('int')
Before the number of classes were as in follows:
1 : 8906
0 : 2968
After changing threshold now it is
1 : 3221
0 : 8653
However is there any other way that I can use test_proportions which ensures the result?
There isn't any sensible way to that. Doing so would create a weird bias in the model. One thing you could do is accept the less likely outcome only is it has high enough score. Normally you'd use 0.5 threshold, but here you might take e.g. 0.7.
The training dataset contains two classes A and B which we represent as 1 and 0 in our target labels correspondingly. Out labels data is heavily skewed towards class 0 which takes roughly 95% of the data while our class 1 is only 5%. How should we construct our loss function in such case?
I found Tensorflow has a function that can be used with weights:
tf.losses.sigmoid_cross_entropy
weights acts as a coefficient for the loss. If a scalar is provided, then the loss is simply scaled by the given value.
Sounds good. I set weights to 2.0 to make loss higher and punish errors more.
loss = loss_fn(targets, cell_outputs, weights=2.0, label_smoothing=0)
However, not only the loss didn't go down it increased and the final accuracy on the dataset decreased slightly. Ok, maybe I misunderstood and it should be < 1.0, I tried a smaller number. This didn't change anything, I got almost the same loss and accuracy. O_o
Needless to say that same network trained on the same dataset but with loss weight 0.3 significantly reduces the loss up to x10 times in Torch / PyTorch.
Can somebody please explain how to use loss weights in Tensorflow?
If you're scaling the loss with a scalar, like 2.0, then basically you're multiplying the loss and therefore the gradient for backpropagation. It's similar to increasing the learning rate, but not exactly the same, because you're also changing the ratio to regularization losses such as weight decay.
If your classes are heavily skewed, and you want to balance it at the calculation of loss, then you have to specify a tensor as weight, as described in the manual for tf.losses.sigmoid_cross_entropy():
weights: Optional Tensor whose rank is either 0, or the same rank as labels, and must be broadcastable to labels (i.e., all dimensions must be either 1, or the same as the corresponding losses dimension).
That is make the weights tensor 1.0 for class 0, and maybe 10 for class 1, and now "false negative" losses will be much more heavily counted.
It is an art how much you should over-weigh the underrepresented class. If you overdo it, the model will collapse and will predict the over-weighted class all the time.
An alternative to achieve the same thing is using tf.nn.weighted_cross_entropy_with_logits(), which has a pos_weight argument for the exact same purpose. But it's in tf.nn not tf.losses so you have to manually add it to the losses collection.
Generally another method to handle this is to arbitrarily increase the proportion of the underrepresented class at sampling. That should not be overdone either, however. You can do both of these things too.
You can set a penalty for misclassification of each sample. If weights is a tensor of shape [batch_size], the loss for each sample will be multiplied by the corresponding weight. So if you assign the same weight to all samples (which is the same as using a scalar weight), your loss will only be scaled by this scalar, and the accuracy should not change.
If you instead assign different weights for the minority class and the majority class, the contributions of the samples to the loss function will be different, and you should be able to influence the accuracy by choosing your weights differently.
A few scenarios (your choice will depend on what you need):
1.) If you want a good overall accuracy, it you could choose the weights of the majority class to be very large and the weights of the minority class much smaller. This will probably lead to a classification of all events into the majority class (i.e. 95 % of total classification accuracy, but the minority class will usually be classified into the wrong class.
2.) If your signal is the minority class and the background is the majority class, you probably want very little background contamination in your predicted signal, i.e. you want almost no background samples to be predicted as signal. This will also happen if you choose the majority weight much larger than the minority weight, but you might find that the network tends to predict all samples to be background. So you will not have any signal samples left.
In this case you should consider a large weight for the minority class + an extra loss for background samples being classified as signal samples (false positives), like this:
loss = weighted_cross_entropy + extra_penalty_for_false_positives