I'm new to ML and I've been working with an imbalanced data set where the count of negative samples is twice that of the positive samples. In-order to address these i set scikit-learn Random forest class_weight = 'balanced', which gave me an ROC-AUC score of 0.904 and the recall for class- 1 was 0.86, now when i tried to further improve the AUC Score by assigning weight, there wasn't any major difference with the results, i.e Class_weight = {0: 0.5, 1: 2.75}, assuming this would penalize for every wrong classification of 1 but it didn't seem to work as expected.
randomForestClf = RandomForestClassifier(random_state = 42, class_weight = {0: 0.5, 1:2.75})
Tried different values but has no major impact as Recall of 1 remains the same or reduces (0.85) and auc value is quite insignificant (0.90122). It only seems to work when one of the label is set 0.
Further tried to set the sample weights too. But that didn't seem to work either.
# Sample weights
class_weights = [0.5, 2]
weights = np.ones(y_train.shape[0], dtype = 'float')
for i, val in enumerate(y_train):
weights[i] = class_weights[val]
Below is the reference to a similar question but the solutions provided didn't work for me.
sklearn RandomForestClassifier's class_weights seems to have no effect
Is there anything that i'm missing out?
Thanks!
The reason is that you grow the trees out fully, which leads to every leaf node being pure. That will happen regardless of the class weights (though the structure of the tree leading up to those pure nodes will change). The predicted probabilities of each tree will be (almost) all 0 or 1, and so the overall probability estimates are just driven by disagreements between the trees.
If you set e.g. max_depth=10 (or whatever tree complexity parameter you like), now many/most of the leaf nodes will not be pure. Setting larger positive-class weights will produce leaf values that are biased toward the positive class (but still aren't just 0 and 1), and so the probability estimates will be skewed higher across the board, leading to a higher recall (at the expense of precision, presumably).
The ROC curve is relatively unaffected by class balance and the skewed-higher probabilities arising from the larger weights, and so shouldn't be heavily affected by changing weights, for a fixed max_depth.
Related
I'm new to ML and I've been working with an imbalanced data set where the count of negative samples is twice that of the positive samples. In-order to address these i set scikit-learn Random forest class_weight = 'balanced', which gave me an ROC-AUC score of 0.904 and the recall for class- 1 was 0.86, now when i tried to further improve the AUC Score by assigning weight, there wasn't any major difference with the results, i.e Class_weight = {0: 0.5, 1: 2.75}, assuming this would penalize for every wrong classification of 1 but it didn't seem to work as expected.
randomForestClf = RandomForestClassifier(random_state = 42, class_weight = {0: 0.5, 1:2.75})
Tried different values but has no major impact as Recall of 1 remains the same or reduces (0.85) and auc value is quite insignificant (0.90122). It only seems to work when one of the label is set 0.
Further tried to set the sample weights too. But that didn't seem to work either.
# Sample weights
class_weights = [0.5, 2]
weights = np.ones(y_train.shape[0], dtype = 'float')
for i, val in enumerate(y_train):
weights[i] = class_weights[val]
Below is the reference to a similar question but the solutions provided didn't work for me.
sklearn RandomForestClassifier's class_weights seems to have no effect
Is there anything that i'm missing out?
Thanks!
The reason is that you grow the trees out fully, which leads to every leaf node being pure. That will happen regardless of the class weights (though the structure of the tree leading up to those pure nodes will change). The predicted probabilities of each tree will be (almost) all 0 or 1, and so the overall probability estimates are just driven by disagreements between the trees.
If you set e.g. max_depth=10 (or whatever tree complexity parameter you like), now many/most of the leaf nodes will not be pure. Setting larger positive-class weights will produce leaf values that are biased toward the positive class (but still aren't just 0 and 1), and so the probability estimates will be skewed higher across the board, leading to a higher recall (at the expense of precision, presumably).
The ROC curve is relatively unaffected by class balance and the skewed-higher probabilities arising from the larger weights, and so shouldn't be heavily affected by changing weights, for a fixed max_depth.
I am trying to build a classifier with LightGBM on a very imbalanced dataset. Imbalance is in the ratio 97:3, i.e.:
Class
0 0.970691
1 0.029309
Params I used and the code for training is as shown below.
lgb_params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric':'auc',
'learning_rate': 0.1,
'is_unbalance': 'true', #because training data is unbalance (replaced with scale_pos_weight)
'num_leaves': 31, # we should let it be smaller than 2^(max_depth)
'max_depth': 6, # -1 means no limit
'subsample' : 0.78
}
# Cross-validate
cv_results = lgb.cv(lgb_params, dtrain, num_boost_round=1500, nfold=10,
verbose_eval=10, early_stopping_rounds=40)
nround = cv_results['auc-mean'].index(np.max(cv_results['auc-mean']))
print(nround)
model = lgb.train(lgb_params, dtrain, num_boost_round=nround)
preds = model.predict(test_feats)
preds = [1 if x >= 0.5 else 0 for x in preds]
I ran CV to get the best model and best round. I got 0.994 AUC on CV and similar score in Validation set.
But when I am predicting on the test set I am getting very bad results. I am sure that the train set is sampled perfectly.
What parameters are needed to be tuned.? What is the reason for the problem.? Should I resample the dataset such that the highest class is reduced.?
The issue is that, despite the extreme class imbalance in your dataset, you are still using the "default" threshold of 0.5 when deciding the final hard classification in
preds = [1 if x >= 0.5 else 0 for x in preds]
This should not be the case here.
This is a rather big topic, and I strongly suggest you do your own research (try googling for threshold or cut off probability imbalanced data), but here are some pointers to get you started...
From a relevant answer at Cross Validated (emphasis added):
Don't forget that you should be thresholding intelligently to make predictions. It is not always best to predict 1 when the model probability is greater 0.5. Another threshold may be better. To this end you should look into the Receiver Operating Characteristic (ROC) curves of your classifier, not just its predictive success with a default probability threshold.
From a relevant academic paper, Finding the Best Classification Threshold in Imbalanced Classification:
2.2. How to set the classification threshold for the testing set
Prediction
results
are
ultimately
determined
according
to
prediction
probabilities.
The
threshold
is
typically
set
to
0.5.
If
the
prediction
probability
exceeds
0.5,
the
sample
is
predicted
to
be
positive;
otherwise,
negative.
However,
0.5
is
not
ideal
for
some
cases,
particularly
for
imbalanced
datasets.
The post Optimizing Probability Thresholds for Class Imbalances from the (highly recommended) Applied Predictive Modeling blog is also relevant.
Take home lesson from all the above: AUC is seldom enough, but the ROC curve itself is often your best friend...
On a more general level regarding the role of the threshold itself in the classification process (which, according to my experience at least, many practitioners get wrong), check also the Classification probability threshold thread (and the provided links) at Cross Validated; key point:
the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.
The training dataset contains two classes A and B which we represent as 1 and 0 in our target labels correspondingly. Out labels data is heavily skewed towards class 0 which takes roughly 95% of the data while our class 1 is only 5%. How should we construct our loss function in such case?
I found Tensorflow has a function that can be used with weights:
tf.losses.sigmoid_cross_entropy
weights acts as a coefficient for the loss. If a scalar is provided, then the loss is simply scaled by the given value.
Sounds good. I set weights to 2.0 to make loss higher and punish errors more.
loss = loss_fn(targets, cell_outputs, weights=2.0, label_smoothing=0)
However, not only the loss didn't go down it increased and the final accuracy on the dataset decreased slightly. Ok, maybe I misunderstood and it should be < 1.0, I tried a smaller number. This didn't change anything, I got almost the same loss and accuracy. O_o
Needless to say that same network trained on the same dataset but with loss weight 0.3 significantly reduces the loss up to x10 times in Torch / PyTorch.
Can somebody please explain how to use loss weights in Tensorflow?
If you're scaling the loss with a scalar, like 2.0, then basically you're multiplying the loss and therefore the gradient for backpropagation. It's similar to increasing the learning rate, but not exactly the same, because you're also changing the ratio to regularization losses such as weight decay.
If your classes are heavily skewed, and you want to balance it at the calculation of loss, then you have to specify a tensor as weight, as described in the manual for tf.losses.sigmoid_cross_entropy():
weights: Optional Tensor whose rank is either 0, or the same rank as labels, and must be broadcastable to labels (i.e., all dimensions must be either 1, or the same as the corresponding losses dimension).
That is make the weights tensor 1.0 for class 0, and maybe 10 for class 1, and now "false negative" losses will be much more heavily counted.
It is an art how much you should over-weigh the underrepresented class. If you overdo it, the model will collapse and will predict the over-weighted class all the time.
An alternative to achieve the same thing is using tf.nn.weighted_cross_entropy_with_logits(), which has a pos_weight argument for the exact same purpose. But it's in tf.nn not tf.losses so you have to manually add it to the losses collection.
Generally another method to handle this is to arbitrarily increase the proportion of the underrepresented class at sampling. That should not be overdone either, however. You can do both of these things too.
You can set a penalty for misclassification of each sample. If weights is a tensor of shape [batch_size], the loss for each sample will be multiplied by the corresponding weight. So if you assign the same weight to all samples (which is the same as using a scalar weight), your loss will only be scaled by this scalar, and the accuracy should not change.
If you instead assign different weights for the minority class and the majority class, the contributions of the samples to the loss function will be different, and you should be able to influence the accuracy by choosing your weights differently.
A few scenarios (your choice will depend on what you need):
1.) If you want a good overall accuracy, it you could choose the weights of the majority class to be very large and the weights of the minority class much smaller. This will probably lead to a classification of all events into the majority class (i.e. 95 % of total classification accuracy, but the minority class will usually be classified into the wrong class.
2.) If your signal is the minority class and the background is the majority class, you probably want very little background contamination in your predicted signal, i.e. you want almost no background samples to be predicted as signal. This will also happen if you choose the majority weight much larger than the minority weight, but you might find that the network tends to predict all samples to be background. So you will not have any signal samples left.
In this case you should consider a large weight for the minority class + an extra loss for background samples being classified as signal samples (false positives), like this:
loss = weighted_cross_entropy + extra_penalty_for_false_positives
I am currently trying to vary the threshold of a Random Forest Classifier in order to plot a ROC Curve. I was under the impression that the only way to do this for a Random Forest is through the use of the class_weight parameter. I have been able to do this successfully, increasing and decreasing precision, recall, true positive and false positive rates; however, I am not sure what I am actually doing. Currently I have the following;
rfc = RandomForestClassifier(n_jobs=-1, oob_score=True, n_estimators=50,max_depth=40,min_samples_split=100,min_samples_leaf=80, class_weight={0:.4, 1:.9})
What is the .4 and .9 actually referring too. I thought it was 40% of data set is 0's and 90% 1's however, this obviously makes no sense (over %100). What is it actually doing? THANKS!
Class weights typically do not need to normalise to 1 (it's only the ratio of the class weights that is important, so demanding that they sum to 1 would not actually be a restriction though).
So setting the class weights to 0.4 and 0.9 is equivalent to assuming a split of class labels in the data of 0.4 / (0.4+0.9) to 0.9 / (0.4+0.9) [roughly ~30% belonging to class 0 and ~70% belonging to class 1].
An alternative way to view differing class weights is as a way of more strongly penalising mistakes in one class compared to another, but still assuming balanced numbers of labelings in the data. In your example, it would be 9/4 times worse to misclassify a 1 as a 0 than it would be to misclassify a 0 as a 1.
The easiest (in my experience) way to vary the discrimination threshold of any of the scikit-learn classifiers is to use the predict_proba() function. Rather than returning a single output class, this returns the probabilities for membership in each class (concretely what it is doing is outputting the proportion of samples in the leaf nodes reached during the classification, averaged over all trees in the random forest.) Once you have these probabilities, it is easy to implement your own final classification step by comparing the probability for each class to some threshold which you can change.
probs = RF.predict_proba(X) # output dimension: [num_samples x num_classes]
for threshold in range(0,100):
threshold = threshold / 100.0
classes = (probs > threshold).astype(int)
# further analysis here as desired
I am trying to implement a solution to Ridge regression in Python using Stochastic gradient descent as the solver. My code for SGD is as follows:
def fit(self, X, Y):
# Convert to data frame in case X is numpy matrix
X = pd.DataFrame(X)
# Define a function to calculate the error given a weight vector beta and a training example xi, yi
# Prepend a column of 1s to the data for the intercept
X.insert(0, 'intercept', np.array([1.0]*X.shape[0]))
# Find dimensions of train
m, d = X.shape
# Initialize weights to random
beta = self.initializeRandomWeights(d)
beta_prev = None
epochs = 0
prev_error = None
while (beta_prev is None or epochs < self.nb_epochs):
print("## Epoch: " + str(epochs))
indices = range(0, m)
shuffle(indices)
for i in indices: # Pick a training example from a randomly shuffled set
beta_prev = beta
xi = X.iloc[i]
errori = sum(beta*xi) - Y[i] # Error[i] = sum(beta*x) - y = error of ith training example
gradient_vector = xi*errori + self.l*beta_prev
beta = beta_prev - self.alpha*gradient_vector
epochs += 1
The data I'm testing this on is not normalized and my implementation always ends up with all the weights being Infinity, even though I initialize the weights vector to low values. Only when I set the learning rate alpha to a very small value ~1e-8, the algorithm ends up with valid values of the weights vector.
My understanding is that normalizing/scaling input features only helps reduce convergence time. But the algorithm should not fail to converge as a whole if the features are not normalized. Is my understanding correct?
You can check from scikit-learn's Stochastic Gradient Descent documentation that one of the disadvantages of the algorithm is that it is sensitive to feature scaling. In general, gradient based optimization algorithms converge faster on normalized data.
Also, normalization is advantageous for regression methods.
The updates to the coefficients during each step will depend on the ranges of each feature. Also, the regularization term will be affected heavily by large feature values.
SGD may converge without data normalization, but that is subjective to the data at hand. Therefore, your assumption is not correct.
Your assumption is not correct.
It's hard to answer this, because there are so many different methods/environments but i will try to mention some points.
Normalization
When some method is not scale-invariant (i think every linear-regression is not) you really should normalize your data
I take it that you are just ignoring this because of debugging / analyzing
Normalizing your data is not only relevant for convergence-time, the results will differ too (think about the effect within the loss-function; big values might effect in much more loss to small ones)!
Convergence
There is probably much to tell about convergence of many methods on normalized/non-normalized data, but your case is special:
SGD's convergence theory only guarantees convergence to some local-minimum (= global-minimum in your convex-opt problem) for some chosings of hyper-parameters (learning-rate and learning-schedule/decay)
Even optimizing normalized data can fail with SGD when those params are bad!
This is one of the most important downsides of SGD; dependence on hyper-parameters
As SGD is based on gradients and step-sizes, non-normalized data has a possibly huge effect on not achieving this convergence!
In order for sgd to converge in linear regression the step size should be smaller than 2/s where s is the largest singular value of the matrix (see the Convergence and stability in the mean section in https://en.m.wikipedia.org/wiki/Least_mean_squares_filter), in the case of ridge regression it should be less than 2*(1+p/s^2)/s where p is the ridge penalty.
Normalizing rows of the matrix (or gradients) changes the loss function to give each sample an equal weight and it changes the singular values of the matrix such that you can choose a step size near 1 (see the NLMS section in https://en.m.wikipedia.org/wiki/Least_mean_squares_filter). Depending on your data it might require smaller step sizes or allow for larger step sizes. It all depends on whether or not the normalization increases or deacreses the largest singular value of the matrix.
Note that when deciding whether or not to normalize the rows you shouldn't just think about the convergence rate (which is determined by the ratio between the largest and smallest singular values) or stability in the mean, but also about how it changes the loss function and whether or not it fits your needs because of that, sometimes it makes sense to normalize but sometimes (for example when you want to give different importance for different samples or when you think that a larger energy for the signal means better snr) it doesn't make sense to normalize.