Class weights vs under/oversampling - python

In imbalanced classification (with scikit-learn) what would be the difference of balancing classes (i.e. set class_weight to balanced) to oversampling with SMOTE for example?
What would be the expected effects of one vs the other?

Class weights directly modify the loss function by giving more (or less) penalty to the classes with more (or less) weight. In effect, one is basically sacrificing some ability to predict the lower weight class (the majority class for unbalanced datasets) by purposely biasing the model to favor more accurate predictions of the higher weighted class (the minority class).
Oversampling and undersampling methods essentially give more weight to particular classes as well (duplicating observations duplicates the penalty for those particular observations, giving them more influence in the model fit), but due to data splitting that typically takes place in training this will yield slightly different results as well.
Please refer to https://datascience.stackexchange.com/questions/52627/why-class-weight-is-outperforming-oversampling

Related

Is it necessary to mitigate class imbalance problem in multiclass text classification?

I am performing multi-class text classification using BERT in python. The dataset that I am using for retraining my model is highly imbalanced. Now, I am very clear that the class imbalance leads to a poor model and one should balance the training set by undersampling, oversampling, etc. before model training.
However, it is also a fact that the distribution of the training set should be similar to the distribution of the production data.
Now, if I am sure that the data thrown at me in the production environment will also be imbalanced, i.e., the samples to be classified will likely belong to one or more classes as compared to some other classes, should I balance my training set?
OR
Should I keep the training set as it is as I know that the distribution of the training set is similar to the distribution of data that I will encounter in the production?
Please give me some ideas, or provide some blogs or papers for understanding this problem.
Class imbalance is not a problem by itself, the problem is too few minority class' samples make it harder to describe its statistical distribution, which is especially true for high-dimensional data (and BERT embeddings have 768 dimensions IIRC).
Additionally, logistic function tends to underestimate the probability of rare events (see e.g. https://gking.harvard.edu/files/gking/files/0s.pdf for the mechanics), which can be offset by selecting a classification threshold as well as resampling.
There's quite a few discussions on CrossValidated regarding this (like https://stats.stackexchange.com/questions/357466). TL;DR:
while too few class' samples may degrade the prediction quality, resampling is not guaranteed to give an overall improvement; at least, there's no universal recipe to a perfect resampling proportion, you'll have to test it out for yourself;
however, real life tasks often weigh classification errors unequally: resampling may help improving certain class' metrics at the cost of overall accuracy. Same applies to classification threshold selection however.
This depends on the goal of your classification:
Do you want a high probability that a random sample is classified correctly? -> Do not balance your training set.
Do you want a high probability that a random sample from a rare class is classified correctly? -> balance your training set or apply weighting during training increasing the weights for rare classes.
For example in web applications seen by clients, it is important that most samples are classified correctly, disregarding rare classes, whereas in the case of anomaly detection/classification, it is very important that rare classes are classified correctly.
Keep in mind that a highly imbalanced dataset tends to always predicting the majority class, therefore increasing the number or weights of rare classes can be a good idea, even without perfectly balancing the training set..
P(label | sample) is not the same as P(label).
P(label | sample) is your training goal.
In the case of gradient-based learning with mini-batches on models with large parameter space, rare labels have a small footprint on the model training. So, your model fits in P(label).
To avoid fitting to P(label), you can balance batches.
Overall batches of an epoch, data looks like an up-sampled minority class. The goal is to get a better loss function that its gradients move parameters toward a better classification goal.
UPDATE
I don't have any proof to show this here. It is perhaps not an accurate statement. With enough training data (with respect to the complexity of features) and enough training steps you may not need balancing. But most language tasks are quite complex and there is not enough data for training. That was the situation I imagined in the statements above.

How does Keras uses class_weight parameter?

Keras uses a class_weight parameter to deal with imbalanced datasets.
Here is what we can find in the doc:
Optional dictionary mapping class indices (integers) to a weight (float) to apply to the model's loss for the samples from this class during training. This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
Does that mean that the class_weight gives a different weights in the training error function to each class? Does it have an influence elsewhere? Is it really efficient against generalization errors, in comparison of "physically" drop instances from the most represented class?
The class_weight parameter weights the loss associated with each training example proportionate to that class's underrepresentation in the training set. This prevents class imbalance during training and should render your network robust to generalization error.
I'd exercise caution when physically dropping data instances corresponding to the most represented class, however - if your network is deep and therefore has significant representational capacity, culling your dataset can lead to overfitting, and consequently poor generalization to the validation/test sets.
I would recommend using the class_weights parameter as specified in the Keras documentation. If you really are intent on dropping data instances from the most represented class, ensure that you tune your network topology to decrease the model's representational capacity(i.e. add Dropout and/or L2 regularization layers).

How does H2O weigh base learners when stacking ensemble is applied?

How does H2O determine the weights for base learners? For exp. here in the example, are all the base learners equally weighted? And do I have a chance to use regularization parameters (e.g. ridge) in metalearner_algorithm? What would be the best way to avoid overfitting?
The main idea of stacked ensemble (and the thing that differentiates it from other types of ensemble, such as random forest, GBMs, simple averaging of confidences) is that it uses another machine learning model to determine how to weight the base learners. (This other model is the meta-learner.)
For your second question, you currently cannot specify any parameters, but there is a ticket for it, so there is a fair chance it will be available in the next few months.
In the meantime, I would say paying attention to over-fitting in the base models, is more important than regularization in the meta-learner.

Classification: What happens if one class has 4 times as much data as the other class?

I am trying to debug an issue with my classifier. The issue is that it always predicts the same class for a given input despite having close to an 80% accuracy.
I trained my CNN to detect the difference between 2 classes. class A has 2575 jpegs and class B has 665 jpegs.
Could this have caused my issue with my CNN always predicting the same class? Is this too much of an imbalance between the # of items in each class? In general, will my performance improve if I make the size of both classes the same(at 665 jpegs?)?
The problem seems to be a case of class imbalance and there are different ways to handle it:
Weighted loss:
You can penalise the reward for the majority loss function by computing a weighted cross entropy.
Resampling the data: As you mentioned you can also downsample the majority class, to balance the classes. You can also upsample the minority class to make it even.
Generate augmented data: Since you are handling images, you can upsample the minority class and then use data augmentation on those images, this solves the class imbalance as well as tackles overfitting and improves generalisation.
and Combination of all the above.

How to interpret feature importance for ensemble methods?

I'm using ensemble methods (random forest, xgbclassifier, etc) for classification.
One important aspect is feature importance prediction, which is like below:
Importance
Feature-A 0.25
Feature-B 0.09
Feature-C 0.08
.......
This model achieves accuracy score around 0.85; obviously Feature-A is dominantly important, so I decided to remove Feature-A and calculated again.
However, after removing Feature-A, I still found a good performance with accuracy around 0.79.
This doesn't make sense to me, because Feature-A contributes 25% for the model, if removed, why accuracy score is barely affected?
I know ensemble methods hold an advantage to combine 'weak' features into 'strong' ones, so accuracy score mostly relies on aggregation and less sensitive to important feature removal?
Thanks
It's possible there are other features that are redundant with Feature A. For instance, suppose that features G,H,I are redundant with feature A: if you know the value of features G,H,I, then the value of feature A is pretty much determined.
That would be consistent with your results. If we include feature A, the model will learn to us it, as it's very simple to get excellent accuracy using just feature A and ignoring features G,H,I, so it'll have excellent accuracy, high importance for feature A, and low importance for features G,H,I. If we exclude feature A, the model can still get almost-as-good accuracy by using features G,H,I, so it'll still have very good accuracy (though the model might become more complicated because the relationship between G,H,I and class is more complicated than the relationship between A and class).

Categories

Resources