Imbalance in scikit-learn

Imbalance in scikit-learn - python

I'm using scikit-learn in my Python program in order to perform some machine-learning operations. The problem is that my data-set has severe imbalance issues.
Is anyone familiar with a solution for imbalance in scikit-learn or in python in general? In Java there's the SMOTE mechanizm. Is there something parallel in python?

There is a new one here
https://github.com/scikit-learn-contrib/imbalanced-learn
It contains many algorithms in the following categories, including SMOTE
Under-sampling the majority class(es).
Over-sampling the minority class.
Combining over- and under-sampling.
Create ensemble balanced sets.

In Scikit learn there are some imbalance correction techniques, which vary according with which learning algorithm are you using.
Some one of them, like Svm or logistic regression, have the class_weight parameter. If you instantiate an SVC with this parameter set on 'balanced', it will weight each class example proportionally to the inverse of its frequency.
Unfortunately, there isn't a preprocessor tool with this purpose.

I found one other library here which implements undersampling and also multiple oversampling techniques including multiple SMOTE implementations and another which uses SVM:
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

Since others have listed links to the very popular imbalanced-learn library I'll give an overview about how to properly use it along with some links.
https://imbalanced-learn.org/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html
https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.RandomOverSampler.html
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html#sphx-glr-auto-examples-over-sampling-plot-comparison-over-sampling-py
https://imbalanced-learn.org/en/stable/combine.html
Some common over-sampling and under-sampling techniques in imbalanced-learn are imblearn.over_sampling.RandomOverSampler, imblearn.under_sampling.RandomUnderSampler, and imblearn.SMOTE. For these libraries there is a nice parameter that allows the user to change the sampling ratio.
For example, in SMOTE, to change the ratio you would input a dictionary, and all values must be greater than or equal to the largest class (since SMOTE is an over-sampling technique). The reason I have found SMOTE to be a better fit for model performance is probably because with RandomOverSampler you are duplicating rows, which means the model can start to memorize the data rather than generalize to new data. SMOTE uses the K-Nearest-Neighbors algorithm to make "similar" data points to those under sampled ones.
It is not good practice to blindly use SMOTE, setting the ratio to it's default (even class balance) because the model may overfit one or more of the minority classes (even though SMOTE is using nearest neighbors to make "similar" observations). In a similar way that you tune hyperparameters of a ML model you will tune the hyperparameters of the SMOTE algorithm, such as the ratio and/or knn. Below is a working example of how to properly use SMOTE.
NOTE: It is vital that you do not use SMOTE on the full data set. You MUST use SMOTE on the training set only (after you split). Then validate on your val/test sets and see if your SMOTE model out performed your other model(s). If you do not do this there will be data leakage and your model is essentially cheating.
from collections import Counter
from sklearn.preprocessing import MinMaxScaler
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import numpy as np
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
sm = SMOTE(random_state=0, n_jobs=8, ratio={'class1':100, 'class2':100, 'class3':80, 'class4':60, 'class5':90})
### Train test split
X_train, X_val, y_train, y_val = train_test_split(X, y)
### Scale the data before applying SMOTE
scaler = MinMaxScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)
### Resample X_train_scaled
X_train_resampled, y_train_resampled = sm.fit_sample(X_train_scaled, y_train)
print('Original dataset shape:', Counter(y_train))
print('Resampled dataset shape:', Counter(y_train_resampled))
### Train a model
xgbc_smote = XGBClassifier(n_jobs=8).fit(X_train_smote, y_train_smote,
eval_set = [(X_val_scaled, y_val)],
early_stopping_rounds=10)
### Evaluate the model
print('\ntrain\n')
print(accuracy_score(xgbc_smote.predict(np.array(X_train_scaled)), y_train))
print(f1_score(xgbc_smote.predict(np.array(X_train_scaled)), y_train))
print('\nval\n')
print(accuracy_score(xgbc_smote.predict(np.array(X_val_scaled)), y_val))
print(f1_score(xgbc_smote.predict(np.array(X_val_scaled)), y_val))

SMOTE is not a builtin in scikit-learn, but there are implementations available online nevertheless.
Edit: The discussion with a SMOTE implementation on GMane that I originally
linked to, appears to be no longer available. The code is preserved here.
The newer answer below, by #nos, is also quite good.

Related

XGBoost for multiclassification and imbalanced data

I am dealing with a classification problem with 3 classes [0,1,2], and imbalanced class distribution as shown below.
I want to apply XGBClassifier (in Python) to this classification problem, but the model does not respond to class_weight adjustments and skews towards the majority class 0, and ignores the minority classes 1,2. Which hyperparameters other than class_weight can help me?
I tried 1) computing class weights using sklearn compute_class_weight; 2) setting weights according to the relative frequency of the classes; 3) and also manually adjusting classes with extreme values to see if any change happens at all, such as {0:0.5,1:100,2:200}. But in any case, it does not help the classifier to take the minority classes into account.
Observations:
I can handle the problem in the binary case: If I make the problem a binary classification by identifying classes [1,2], then I can get the classifier work properly by adjusting scale_pos_weight (even in this case class_weight alone does not help).
But scale_pos_weight, as far as I know, works for binary classification. Is there an analogue of this parameter for the multi-classification problems?
Using RandomForestClassifier instead of XGBClassifier, I can handle the problem by setting class_weight='balanced_subsample' and tunning max_leaf_nodes. But, for some reason, this approach does not work for XGBClassifier.
Remark: I know about balancing techniques, such as over/undersampling, or SMOTE. But I want to avoid them as much as possible, and prefer a solutions using hyperparameter tunning of the model if possible.
My observation above shows that this can work for the binary case.

sample_weight parameter is useful for handling imbalanced data while using XGBoost for training the data. You can compute sample weights by using compute_sample_weight() of sklearn library.
This code should work for multiclass data:
from sklearn.utils.class_weight import compute_sample_weight
sample_weights = compute_sample_weight(
class_weight='balanced',
y=train_df['class'] #provide your own target name
)
xgb_classifier.fit(X, y, sample_weight=sample_weights)

You can use sample_weight as #Prakash Dahal suggested, but compute your own weights. I found that different weights made a dramatic difference (I have 12 classes and very imbalanced data).
If you compute your own weights, you need to assign the relevant weight to each entry and pass the param to the classifier in the same way:
xgb_class.fit(X_train, y_train, sample_weight=weights)

How to improve my multiclass text-classification on German text?

I am new in NLP and it is a bit confusing me.
I am trying to do a text classification with SVC on my dataset.
I have an imbalanced dataset of 6 classes.
The text is news for classes of health, sport, culture, economy, science and web.
I am using TF-IDF for vectorization.
the preprocessing steps: lower-case all the texts and to remove the stop-words. since my text is in German I did not use lemmatization
my first try:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2, random_state=42)
X_train = train['text']
y_train = train['category']
X_test = test['text']
y_test = test['category']
# Linear SVC:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC()),])
predictions = text_clf_lsvc.predict(X_test)
my metrci acuuracy score was: 93%
then I decided to reduce the dimensionality: so on my 2nd try I added TruncatedSVD
# Linear SVC:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()),('svd', TruncatedSVD()),
('clf', LinearSVC()),])
predictions = text_clf_lsvc.predict(X_test)
my metrci acuuracy score dropped to 34%.
my questions:
1- How can I improve my model if I want to stick to TF-IDF and SVC for classification
2- What can I do other than that if I want to have a good classification

The best way to improve accuracy, given that you want to stick with this configuration is through hyperparameter tuning, or by introducing additional components, such as feature selection.
Hyperparameter tuning
Most machine learning algorithms and parts of a machine learning pipeline have several parameters you can change. For example, the TfidfVectorizer has different ngram ranges, different analysis levels, different tokenizers, and many more parameters to vary. Most of these will affect your performance. So, what you can do is systematically vary these parameters (and those of your SVC), while monitoring you accuracy on a development set (i.e., not the test data!). Instead of fixed development set, cross-validation is typically used in these kinds of settings.
The best way to do this in sklearn is through a RandomizedSearchCV (see here for details). This class automatically cross-validates and searches through the possible options you pre-specify by randomly sampling from the option set for a fixed number of iterations. By applying this technique on your training data, you will automatically find models that perform better for your given training data and your options. Ideally, these models would also perform better on your test data. Fair warning: cross-validated search techniques can take a while to run.
Feature Selection
In addition to grid search, another way to improve performance is through feature selection. Feature selection typically consists of a statistical test that determines which features explain variance in the typical task you are trying to solve. The feature selection methods in sklearn are detailed here.
By far the most important bit here is that the performance of anything you add to your model should be verified on an independent development set, or in cross-validation. Leave your test data alone.

What exactly does `eli5.show_weights` display for a classification model?

I used eli5 to apply the permutation procedure for feature importance. In the documentation, there is some explanation and a small example but it is not clear.
I am using a sklearn SVC model for a classification problem.
My question is: Are these weights the change (decrease/increase) of the accuracy when the specific feature is shuffled OR is it the SVC weights of these features?
In this medium article, the author states that these values show the reduction in model performance by the reshuffle of that feature. But not sure if that's indeed the case.
Small example:
from sklearn import datasets
import eli5
from eli5.sklearn import PermutationImportance
from sklearn.svm import SVC, SVR
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
clf = SVC(kernel='linear')
perms = PermutationImportance(clf, n_iter=1000, cv=10, scoring='accuracy').fit(X, y)
print(perms.feature_importances_)
print(perms.feature_importances_std_)
[0.38117333 0.16214 ]
[0.1349115 0.11182505]
eli5.show_weights(perms)

I did some deep research.
After going through the source code here is what I believe for the case where cv is used and is not prefit or None. I use a K-Folds scheme for my application. I also use a SVC model thus, score is the accuracy in this case.
By looking at the fit method of thePermutationImportance object, the _cv_scores_importances are computed (https://github.com/TeamHG-Memex/eli5/blob/master/eli5/sklearn/permutation_importance.py#L202). The specified cross-validation scheme is used and the base_scores, feature_importances are returned using the test data (function: _get_score_importances inside _cv_scores_importances).
By looking at get_score_importances function (https://github.com/TeamHG-Memex/eli5/blob/master/eli5/permutation_importance.py#L55), we can see that base_score is the score on the non shuffled data and feature_importances (called differently there as: scores_decreases) are defined as non shuffled score - shuffled score (see https://github.com/TeamHG-Memex/eli5/blob/master/eli5/permutation_importance.py#L93)
Finally, the errors (feature_importances_std_) are the SD of the above feature_importances (https://github.com/TeamHG-Memex/eli5/blob/master/eli5/sklearn/permutation_importance.py#L209) and the feature_importances_ is the mean of the above feature_importances (non-shuffled score minus (-) shuffled score).

A fair bit shorter answer to your original question, regardless of the setting for the cv parameter, eli5 will calculate the average decrease in the scorer you provide. Because you're using the sklearn wrapper, the scorer will come from scikit-learn: in your case accuracy. Overall as a word on the package, some of these details are particularly difficult to figure out without going into the deeper into the source code, might be worth trying to submit a pull request to make the documentation more detailed where possible.

Scikit learn fit estimator with predefined number of classes

So, I need to use some of the estimators in scikit-learn, namely LogisticRegression and SVM, but I have a problem, I have an extremely unbalanced dataset and need to run Kfold cross validation. The thing is sometimes the fold I am fitting can have only one target class of the available ones. I wanted to know if there's any way with these estimators to predefine the number of classes, maybe something like passing them a one-hot encoding representations of the target where it doesn't matter if all the examples are from one class, the shape of the target matrix will define the number of classes already.
Is there any way to do this with scikit-learn? Maybe with another library? I know those two algorithms use liblinear, maybe there's some interface I can use in that case.
Any way, thank you for your time.
EDIT: StratifiedFold cross validation is not useful for me because sometimes I have less amount of occurrences than the number of folds. E.g. it can happen that I have a dataset with 50 instances and 3 classes, but 46 can be of one class, 2 of a second class and 2 of a third class and though I can go for 3 fold cross validation I would generally need results of more folds than that, plus even with 3 folds still leaves open the case where one class is the only available for one fold.

The comment that said you need to gather more data may be right. However if you believe you have enough data for your model to learn something useful, you can over sample your minority classes (or possibly under sample the majority classes, but this sounds like a problem for over sampling). Having only one class in the data set makes it pretty much impossible for your model to learn anything about that class.
Here are some links to over sampling and under sampling libraries in python. The famous imbalanced-learn library is great.
https://imbalanced-learn.org/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html
https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.RandomOverSampler.html
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html#sphx-glr-auto-examples-over-sampling-plot-comparison-over-sampling-py
https://imbalanced-learn.org/en/stable/combine.html
Your case sounds like a good candidate for SMOTE. You also mentioned you wanted to change the ratio. There is a parameter in imblearn.over_sampling.SMOTE called ratio, where you would pass a dictionary. You can also do it with percentages (see the documentation).
SMOTE uses the K-Nearest-Neighbors algorithm to make "similar" data points to those under sampled ones. This is a more powerful algorithm than traditional over-sampling because then when your model gets the training data it helps avoid the issue where your model is memorizing key points of specific examples. Instead, smote creates a "similar" data point (likely in a multi-dimensional space) so your model can learn to generalize better.
NOTE: It is vital that you do not use SMOTE on the full data set. You MUST use SMOTE on the training set only (i.e. after you split), and then validate on the validation set and test sets to see if your SMOTE model out performed your other model(s). If you do not do this, there will be data leakage and you will get a model that doesn't even closely resemble what you want.
from collections import Counter
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import numpy as np
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
sm = SMOTE(random_state=0, n_jobs=8, ratio={'class1':100, 'class2':100, 'class3':80, 'class4':60, 'class5':90})
X_resampled, y_resampled = sm.fit_sample(X_normalized, y)
print('Original dataset shape:', Counter(y))
print('Resampled dataset shape:', Counter(y_resampled))
X_train_smote, X_test_smote, y_train_smote, y_test_smote = train_test_split(X_resampled, y_resampled)
X_train_smote.shape, X_test_smote.shape, y_train_smote.shape, y_test_smote.shape, X_resampled.shape, y_resampled.shape
smote_xgbc = XGBClassifier(n_jobs=8).fit(X_train_smote, y_train_smote)
print('TRAIN')
print(accuracy_score(smote_xgbc.predict(np.array(X_train_normalized)), y_train))
print(f1_score(smote_xgbc.predict(np.array(X_train_normalized)), y_train))
print('TEST')
print(accuracy_score(smote_xgbc.predict(np.array(X_test_normalized)), y_test))
print(f1_score(smote_xgbc.predict(np.array(X_test_normalized)), y_test))

How does LassoCV in scikit-learn partition data?

I am performing linear regression using the Lasso method in sklearn.
According to their guidance, and that which I have seen elsewhere, instead of simply conducting cross validation on all of the training data it is advised to split it up into more traditional training set / validation set partitions.
The Lasso is thus trained on the training set and then the hyperparameter alpha is tuned on the basis of results from cross validation of the validation set. Finally, the accepted model is used on the test set to give a realistic view oh how it will perform in reality. Seperating the concerns out here is a preventative measure against overfitting.
Actual Question
Does Lasso CV conform to the above protocol or does it just somehow train the model paramaters and hyperparameters on the same data and/or during the same rounds of CV?
Thanks.

If you use sklearn.cross_validation.cross_val_score with a sklearn.linear_model.LassoCV object, then you are performing nested cross-validation. cross_val_score will divide your data into train and test sets according to how you specify the folds (which can be done with objects such as sklearn.cross_validation.KFold). The train set will be passed to the LassoCV, which itself performs another splitting of the data in order to choose the right penalty. This, it seems, corresponds to the setting you are seeking.
import numpy as np
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.linear_model import LassoCV
X = np.random.randn(20, 10)
y = np.random.randn(len(X))
cv_outer = KFold(len(X), n_folds=5)
lasso = LassoCV(cv=3) # cv=3 makes a KFold inner splitting with 3 folds
scores = cross_val_score(lasso, X, y, cv=cv_outer)
Answer: no, LassoCV will not do all the work for you, and you have to use it in conjunction with cross_val_score to obtain what you want. This is at the same time the reasonable way of implementing such objects, since we can also be interested in only fitting a hyperparameter optimized LassoCV without necessarily evaluating it directly on another set of held out data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.