High AUC but bad predictions with imbalanced data - python

I am trying to build a classifier with LightGBM on a very imbalanced dataset. Imbalance is in the ratio 97:3, i.e.:
Class
0 0.970691
1 0.029309
Params I used and the code for training is as shown below.
lgb_params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric':'auc',
'learning_rate': 0.1,
'is_unbalance': 'true', #because training data is unbalance (replaced with scale_pos_weight)
'num_leaves': 31, # we should let it be smaller than 2^(max_depth)
'max_depth': 6, # -1 means no limit
'subsample' : 0.78
}
# Cross-validate
cv_results = lgb.cv(lgb_params, dtrain, num_boost_round=1500, nfold=10,
verbose_eval=10, early_stopping_rounds=40)
nround = cv_results['auc-mean'].index(np.max(cv_results['auc-mean']))
print(nround)
model = lgb.train(lgb_params, dtrain, num_boost_round=nround)
preds = model.predict(test_feats)
preds = [1 if x >= 0.5 else 0 for x in preds]
I ran CV to get the best model and best round. I got 0.994 AUC on CV and similar score in Validation set.
But when I am predicting on the test set I am getting very bad results. I am sure that the train set is sampled perfectly.
What parameters are needed to be tuned.? What is the reason for the problem.? Should I resample the dataset such that the highest class is reduced.?

The issue is that, despite the extreme class imbalance in your dataset, you are still using the "default" threshold of 0.5 when deciding the final hard classification in
preds = [1 if x >= 0.5 else 0 for x in preds]
This should not be the case here.
This is a rather big topic, and I strongly suggest you do your own research (try googling for threshold or cut off probability imbalanced data), but here are some pointers to get you started...
From a relevant answer at Cross Validated (emphasis added):
Don't forget that you should be thresholding intelligently to make predictions. It is not always best to predict 1 when the model probability is greater 0.5. Another threshold may be better. To this end you should look into the Receiver Operating Characteristic (ROC) curves of your classifier, not just its predictive success with a default probability threshold.
From a relevant academic paper, Finding the Best Classification Threshold in Imbalanced Classification:
2.2. How to set the classification threshold for the testing set
Prediction
results
are
ultimately
determined
according
to
prediction
probabilities.
The
threshold
is
typically
set
to
0.5.
If
the
prediction
probability
exceeds
0.5,
the
sample
is
predicted
to
be
positive;
otherwise,
negative.
However,
0.5
is
not
ideal
for
some
cases,
particularly
for
imbalanced
datasets.
The post Optimizing Probability Thresholds for Class Imbalances from the (highly recommended) Applied Predictive Modeling blog is also relevant.
Take home lesson from all the above: AUC is seldom enough, but the ROC curve itself is often your best friend...
On a more general level regarding the role of the threshold itself in the classification process (which, according to my experience at least, many practitioners get wrong), check also the Classification probability threshold thread (and the provided links) at Cross Validated; key point:
the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

Related

Class_weight and sample_weight ineffective for sklearn Random Forest

I'm new to ML and I've been working with an imbalanced data set where the count of negative samples is twice that of the positive samples. In-order to address these i set scikit-learn Random forest class_weight = 'balanced', which gave me an ROC-AUC score of 0.904 and the recall for class- 1 was 0.86, now when i tried to further improve the AUC Score by assigning weight, there wasn't any major difference with the results, i.e Class_weight = {0: 0.5, 1: 2.75}, assuming this would penalize for every wrong classification of 1 but it didn't seem to work as expected.
randomForestClf = RandomForestClassifier(random_state = 42, class_weight = {0: 0.5, 1:2.75})
Tried different values but has no major impact as Recall of 1 remains the same or reduces (0.85) and auc value is quite insignificant (0.90122). It only seems to work when one of the label is set 0.
Further tried to set the sample weights too. But that didn't seem to work either.
# Sample weights
class_weights = [0.5, 2]
weights = np.ones(y_train.shape[0], dtype = 'float')
for i, val in enumerate(y_train):
weights[i] = class_weights[val]
Below is the reference to a similar question but the solutions provided didn't work for me.
sklearn RandomForestClassifier's class_weights seems to have no effect
Is there anything that i'm missing out?
Thanks!
The reason is that you grow the trees out fully, which leads to every leaf node being pure. That will happen regardless of the class weights (though the structure of the tree leading up to those pure nodes will change). The predicted probabilities of each tree will be (almost) all 0 or 1, and so the overall probability estimates are just driven by disagreements between the trees.
If you set e.g. max_depth=10 (or whatever tree complexity parameter you like), now many/most of the leaf nodes will not be pure. Setting larger positive-class weights will produce leaf values that are biased toward the positive class (but still aren't just 0 and 1), and so the probability estimates will be skewed higher across the board, leading to a higher recall (at the expense of precision, presumably).
The ROC curve is relatively unaffected by class balance and the skewed-higher probabilities arising from the larger weights, and so shouldn't be heavily affected by changing weights, for a fixed max_depth.

sklearn RandomForestClassifier's class_weights seems to have no effect [duplicate]

I'm new to ML and I've been working with an imbalanced data set where the count of negative samples is twice that of the positive samples. In-order to address these i set scikit-learn Random forest class_weight = 'balanced', which gave me an ROC-AUC score of 0.904 and the recall for class- 1 was 0.86, now when i tried to further improve the AUC Score by assigning weight, there wasn't any major difference with the results, i.e Class_weight = {0: 0.5, 1: 2.75}, assuming this would penalize for every wrong classification of 1 but it didn't seem to work as expected.
randomForestClf = RandomForestClassifier(random_state = 42, class_weight = {0: 0.5, 1:2.75})
Tried different values but has no major impact as Recall of 1 remains the same or reduces (0.85) and auc value is quite insignificant (0.90122). It only seems to work when one of the label is set 0.
Further tried to set the sample weights too. But that didn't seem to work either.
# Sample weights
class_weights = [0.5, 2]
weights = np.ones(y_train.shape[0], dtype = 'float')
for i, val in enumerate(y_train):
weights[i] = class_weights[val]
Below is the reference to a similar question but the solutions provided didn't work for me.
sklearn RandomForestClassifier's class_weights seems to have no effect
Is there anything that i'm missing out?
Thanks!
The reason is that you grow the trees out fully, which leads to every leaf node being pure. That will happen regardless of the class weights (though the structure of the tree leading up to those pure nodes will change). The predicted probabilities of each tree will be (almost) all 0 or 1, and so the overall probability estimates are just driven by disagreements between the trees.
If you set e.g. max_depth=10 (or whatever tree complexity parameter you like), now many/most of the leaf nodes will not be pure. Setting larger positive-class weights will produce leaf values that are biased toward the positive class (but still aren't just 0 and 1), and so the probability estimates will be skewed higher across the board, leading to a higher recall (at the expense of precision, presumably).
The ROC curve is relatively unaffected by class balance and the skewed-higher probabilities arising from the larger weights, and so shouldn't be heavily affected by changing weights, for a fixed max_depth.

Sklearn GaussianMixture

I have been learning for myself for several months artificial intelligence through a project of character recognition and transcription of handwriting. Until now I have successfully used Keras, Theano and Tensorflow by implementing CNN, CTC neural networks.
Today, I try to use Gaussian mixture models, the first step towards hidden markov models with Gaussian emission. To do so, I used the sklearn mixture with pca reduction to select the best model with Akaike and Bayesian information criterion. With type of covariance Full for Aic which provides a nice U-curve and Tied for Bic, because with Full covariance Bic gives just a linear curve. With 12.000 samples, I get the best model at 60 n-components for Aic and 120 n-components for Bic.
My input images have 64 pixels aside which represent only the capital letters of the English alphabet, 26 categories numbered from 0 to 25.
The fit method of Sklearn GaussianMixture ignore labels and the predict method returns the position of the component (0 to 59 or 0 to 119) into the n-components regarding the probabilities.
How to retrieve the original label the position of the character in a list using sklearn GaussianMixture ?
So, you want to use GaussianMixture in a generative classifier. You need to compute P(Y|X) for each label and estimate label according to these probabilities. To do so, you need to keep a GMM for each label and train with data from corresponding label. Then score method will give you likelihood, P(X|Y), of given data (or log-likelihood, you may want to check that). If you multiple likelihood with prior, you get posterior, P(Y|X). For each label, you will get a posterior e.g. P(Y=0|X), P(Y=1|X), ... Label with the maximum posterior probability can be reported as estimated label.
You can get some hints from the code sample below. (Here it is assumed that prior probabilities are equal, you need to consider that in your implementation)
Y_predicted = clf.predict(X_test)
score = np.empty((Y_test.shape[0], 10))
predictor_list = []
for i in range(10):
predictor = GMM()
predictor.fit(X[Y==i])
predictor_list.append(predictor)
score[:, i] = predictor.score(X_test)
Y_predicted = np.argmax(score, axis=1)

Class_Weight in Random Forest Python

I am currently trying to vary the threshold of a Random Forest Classifier in order to plot a ROC Curve. I was under the impression that the only way to do this for a Random Forest is through the use of the class_weight parameter. I have been able to do this successfully, increasing and decreasing precision, recall, true positive and false positive rates; however, I am not sure what I am actually doing. Currently I have the following;
rfc = RandomForestClassifier(n_jobs=-1, oob_score=True, n_estimators=50,max_depth=40,min_samples_split=100,min_samples_leaf=80, class_weight={0:.4, 1:.9})
What is the .4 and .9 actually referring too. I thought it was 40% of data set is 0's and 90% 1's however, this obviously makes no sense (over %100). What is it actually doing? THANKS!
Class weights typically do not need to normalise to 1 (it's only the ratio of the class weights that is important, so demanding that they sum to 1 would not actually be a restriction though).
So setting the class weights to 0.4 and 0.9 is equivalent to assuming a split of class labels in the data of 0.4 / (0.4+0.9) to 0.9 / (0.4+0.9) [roughly ~30% belonging to class 0 and ~70% belonging to class 1].
An alternative way to view differing class weights is as a way of more strongly penalising mistakes in one class compared to another, but still assuming balanced numbers of labelings in the data. In your example, it would be 9/4 times worse to misclassify a 1 as a 0 than it would be to misclassify a 0 as a 1.
The easiest (in my experience) way to vary the discrimination threshold of any of the scikit-learn classifiers is to use the predict_proba() function. Rather than returning a single output class, this returns the probabilities for membership in each class (concretely what it is doing is outputting the proportion of samples in the leaf nodes reached during the classification, averaged over all trees in the random forest.) Once you have these probabilities, it is easy to implement your own final classification step by comparing the probability for each class to some threshold which you can change.
probs = RF.predict_proba(X) # output dimension: [num_samples x num_classes]
for threshold in range(0,100):
threshold = threshold / 100.0
classes = (probs > threshold).astype(int)
# further analysis here as desired

how to get the most value training data under tensorflow framework

In case I want to add more training data to an existing classification model. Since the cost of labeling training data is high, I just want to label the most value data to the existing model.
For example, we only have two classes(A/B) in our classification problem, then use the existing model to predict three un-label data, and get the probability distribution:
Data A B
Case 1: features -> 0.9 0.1
Case 2: features -> 0.6 0.4
Case 3: features -> 0.5 0.5
Case 3 should be the most value training data since current model doesn't know which class it belongs to. Is it right? if so, entropy should be a good metrics here, but I just can't find the implementation of tf.reduce_entropy in tensorflow
Can't you use the scipy implementation for entropy? https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html
scipy.stats.entropy(pk)
You can get the predictions for your unlabeled data, then calculate entropy for each of the predictions.
Hope this helps!

Categories

Resources