Keras uses a class_weight parameter to deal with imbalanced datasets.
Here is what we can find in the doc:
Optional dictionary mapping class indices (integers) to a weight (float) to apply to the model's loss for the samples from this class during training. This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
Does that mean that the class_weight gives a different weights in the training error function to each class? Does it have an influence elsewhere? Is it really efficient against generalization errors, in comparison of "physically" drop instances from the most represented class?
The class_weight parameter weights the loss associated with each training example proportionate to that class's underrepresentation in the training set. This prevents class imbalance during training and should render your network robust to generalization error.
I'd exercise caution when physically dropping data instances corresponding to the most represented class, however - if your network is deep and therefore has significant representational capacity, culling your dataset can lead to overfitting, and consequently poor generalization to the validation/test sets.
I would recommend using the class_weights parameter as specified in the Keras documentation. If you really are intent on dropping data instances from the most represented class, ensure that you tune your network topology to decrease the model's representational capacity(i.e. add Dropout and/or L2 regularization layers).
Related
I am dealing with a classification problem with 3 classes [0,1,2], and imbalanced class distribution as shown below.
I want to apply XGBClassifier (in Python) to this classification problem, but the model does not respond to class_weight adjustments and skews towards the majority class 0, and ignores the minority classes 1,2. Which hyperparameters other than class_weight can help me?
I tried 1) computing class weights using sklearn compute_class_weight; 2) setting weights according to the relative frequency of the classes; 3) and also manually adjusting classes with extreme values to see if any change happens at all, such as {0:0.5,1:100,2:200}. But in any case, it does not help the classifier to take the minority classes into account.
Observations:
I can handle the problem in the binary case: If I make the problem a binary classification by identifying classes [1,2], then I can get the classifier work properly by adjusting scale_pos_weight (even in this case class_weight alone does not help).
But scale_pos_weight, as far as I know, works for binary classification. Is there an analogue of this parameter for the multi-classification problems?
Using RandomForestClassifier instead of XGBClassifier, I can handle the problem by setting class_weight='balanced_subsample' and tunning max_leaf_nodes. But, for some reason, this approach does not work for XGBClassifier.
Remark: I know about balancing techniques, such as over/undersampling, or SMOTE. But I want to avoid them as much as possible, and prefer a solutions using hyperparameter tunning of the model if possible.
My observation above shows that this can work for the binary case.
sample_weight parameter is useful for handling imbalanced data while using XGBoost for training the data. You can compute sample weights by using compute_sample_weight() of sklearn library.
This code should work for multiclass data:
from sklearn.utils.class_weight import compute_sample_weight
sample_weights = compute_sample_weight(
class_weight='balanced',
y=train_df['class'] #provide your own target name
)
xgb_classifier.fit(X, y, sample_weight=sample_weights)
You can use sample_weight as #Prakash Dahal suggested, but compute your own weights. I found that different weights made a dramatic difference (I have 12 classes and very imbalanced data).
If you compute your own weights, you need to assign the relevant weight to each entry and pass the param to the classifier in the same way:
xgb_class.fit(X_train, y_train, sample_weight=weights)
I have gathered some train dataset to train the network model, but unfortunately the dataset is critically unbalanced is there a way to balancing the data using Keras library without the need to balance it manually (dataset of two objects: object 1 2000 data while the other is 15000 ) , I don't want to use upsampling or downsampling cause I don't want to get problems in overfitting or underfitting
There are a number of ways and best-practices to deal with so called imbalanced data sets.
Upsample the minority class (Drawback: possibly overfitting of minority class)
Downsample the majority class (Drawback: loss of training data, information loss)
There are a number of techniques you can use for this, some even offer methods to overcome drawbacks (e.g. synthetic sampling). Have a look at the imbalanced-learn package for a easy-to-use implementation.
Another thing you could use is to weight the loss of your model in order to tell the model that it should "pay more attention" to specific classes. This can be easily done by defining the optional argument class_weight in keras fit function. The class weights can be easily computed by sklearns compute_class_weight function.
In imbalanced classification (with scikit-learn) what would be the difference of balancing classes (i.e. set class_weight to balanced) to oversampling with SMOTE for example?
What would be the expected effects of one vs the other?
Class weights directly modify the loss function by giving more (or less) penalty to the classes with more (or less) weight. In effect, one is basically sacrificing some ability to predict the lower weight class (the majority class for unbalanced datasets) by purposely biasing the model to favor more accurate predictions of the higher weighted class (the minority class).
Oversampling and undersampling methods essentially give more weight to particular classes as well (duplicating observations duplicates the penalty for those particular observations, giving them more influence in the model fit), but due to data splitting that typically takes place in training this will yield slightly different results as well.
Please refer to https://datascience.stackexchange.com/questions/52627/why-class-weight-is-outperforming-oversampling
This is a continuation of a question:
https://datascience.stackexchange.com/questions/22814/class-weighting-during-validation-in-keras
class_weight can be used in Keras fit function to tell the optimizer to weight the under represented class. According to the answer in stackexchange it is also considered during validation. For example, if my class ratio is 10 negatives for every 1 positive then an accuracy score of 0.8 is not so good (a fixed classifier on negatives will do better). I have two questions:
How exactly class_weight is being considered during validation?
How can I use class_weight in fit_generator? is it the same parameter as in fit?
For your first question, it is considered the same way as during training.
Basically if you look at the function weighted_masked_objective, individual samples are multiplied with the weights, and the mean is returned back.(Note: Keras does not automatically set the class weights, you need to pass the weights at model.fit() or model.fit_generator()
Class weights can be computed by the inverse proportion to frequency. Using sklearn
fit_generator is identical to fit, except for the fact it takes a generator as input.
I have class imbalance problem and want to solve this using cost sensitive learning.
under sample and over sample
give weights to class to use a modified loss function
Question
Scikit learn has 2 options called class weights and sample weights. Is sample weight actually doing option 2) and class weight options 1). Is option 2) the the recommended way of handling class imbalance.
It's similar concepts, but with sample_weights you can force estimator to pay more attention on some samples, and with class_weights you can force estimator to learn with attention to some particular class. sample_weight=0 or class_weight=0 basically means that estimator doesn't need to take into consideration such samples/classes in learning process at all. Thus classifier (for example) will never predict some class if class_weight = 0 for this class. If some sample_weight/class_weight bigger than sample_weight/class_weight on other samples/classes - estimator will try to minimize error on that samples/classes in the first place. You can use user-defined sample_weights and class_weights simultaneously.
If you want to undersample/oversample your training set with simple cloning/removing - this will be equal to increasing/decreasing of corresponding sample_weights/class_weights.
In more complex cases you can also try artificially generate samples, with techniques like SMOTE.
sample_weight and class_weight have a similar function, that is to make your estimator pay more attention to some samples.
Actual sample weights will be sample_weight * weights from class_weight.
This serves the same purpose as under/oversampling but the behavior is likely to be different: say you have an algorithm that randomly picks samples (like in random forests), it matters whether you oversampled or not.
To sum it up:
class_weight and sample_weight both do 2), option 2) is one way to handle class imbalance. I don't know of an universally recommended way, I would try 1), 2) and 1) + 2) on your specific problem to see what works best.