Oversampling on binary classification

Oversampling on binary classification - python

everyone.
I am doing a binary classification on a huge dataset (190 columns, 500K records). The target values are 0 and 1. However, when I do the oversampling with SMOTE, new target values in the y-vector are created (0, 1, 2 for example). I do not know how to avoid that. I would like to know how to limit the oversampling values to 0 and 1.
this is what I am doing:
over = SMOTE(sampling_strategy='minority')
X, y = over.fit_resample(X, y)
y's vector type is int64.
Also, I am modeling using an ANN using Keras. The training data set available for me is split into three different datasets: train, val and test, for training, validation and testing purposes. I wonder which one should be OVERsampled (or UNDERsampled), or if I should OVERsample (or UNDERsample) BEFORE splitting.
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.15)
WARNING: my model will be evaluated later on using a much larger dataset (1000K records) which is the REAL test dataset from the company I am working for, and whose target values are UNKNOWN for me and my model.
Thanx
Johnny

Since you have 190 columns, 500K records, SMOTE will not work well as SMOTE technically interpolations techniques. It may suffer from curse of dimensionality. OVERsample and UNDERsample approaches also have some limitations. I would prefer to use class weight. Also, make sure that all data pre-processing are performed separately on test data and train data. Here is paper outlined the framework: https://www.sciencedirect.com/science/article/pii/S2666827022000585

Related

How do I split a dataset into training and testing whilst retaining the proportions of binary data (i.e some drugs work some don't)?

I have a dataset of drugs, associated chemical features and whether they are "responsive" or "Unresponsive". I need to ensure that once I split the dataset into test and train they both have the same proportion of responsive:unresponsive. I know how to randomly split the data where training is 80% and test is 20%. Not sure how to do the stratified sampling necessary here, is this what I'm meant to use - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html?

The train_test_split function already has one parameters that allows you keeping the proportion of y. The parameter is stratify; and is defined in the documentation as "If not None, data is split in a stratified fashion, using this as the class labels".
An example of code would be:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

Multi labeled image classification with imbalanced data, how to split it?

I am working multi labeled image classification. This is my data frame:
[UPDATED]
As you can see images labeled with 26 features. "1" means exist, "0" means not exist.
My problem is in many of label has imbalanced data. For example:
[1] train_df.value_counts('Eyeglasses')
Output:
Eyeglasses
0 54735
1 1265
dtype: int64
[2] train_df.value_counts('Double_Chin')
Output:
Double_Chin
0 55464
1 536
dtype: int64
How can I split it both of for training and validation data as a balanced?
[UPDATE]
I tried to
from imblearn.over_sampling import SMOTE
smote = SMOTE()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
X_train_smote, y_train_smote = smote.fit_sample(X_train, y_train)
ValueError: Imbalanced-learn currently supports binary, multiclass and
binarized encoded multiclasss targets. Multilabel and multioutput
targets are not supported.

Your question mixes two concepts: splitting a multi-class, multi-label image dataset into subsets which have proportional representation, and resampling methods to deal with class imbalance. I am going to focus on just the splitting part of the problem, since that's what the title is about.
I would use a stratified-shuffle-split so to make sure that each subset has equal reprentation. Here's a handy visual for stratified sampling from Wikipedia:
For this I recommend skmultilearn's IterativeStratification method. It supports multi-label datasets.
from skmultilearn.model_selection.iterative_stratification import IterativeStratification
stratifier = IterativeStratification(
n_splits=2, order=2, sample_distribution_per_fold=[1.0 - train_fraction, train_fraction],
)
# this class is a generator that produces k-folds. we just want to iterate it once to make a single static split
# NOTE: needs to be computed on hard labels.
train_indexes, everything_else_indexes = next(stratifier.split(X=img_urls, y=labels))
# s3url array shape (N_samp,)
x_train, x_else = img_urls[train_indexes], img_urls[everything_else_indexes]
# labels array shape (N_samp, n_classes)
Y_train, Y_else = labels[train_indexes, :], labels[everything_else_indexes, :]
I wrote a more complete solution, including unit tests, in a blog post.
One downside with skmultilearn is that it is not very well maintained and has some broken functionality. I documented a few of these sharp corners and gotchas in my blog post. Also note that this stratification procedure is painfully slow when you get to several million images because the stratifier only uses a single CPU.

Ttfidfvectorizer adjust test by training frequencies in pipeline during cross-validation

I have a text classification task based on documents, where I expect the classes are related to word frequencies. Because of the specific nature of my application, where I have a corpus that will grow over time and want to classify new documents as they arrive, I have used FeatureHasher rather than the existing TFidfVectorizer (which both vectorizes and does adjustment), since the vocabulary size can grow with new documents.
As discussed here for instance (https://stats.stackexchange.com/questions/154660/tfidfvectorizer-should-it-be-used-on-train-only-or-traintest), it seems correct to me that the term frequencies when doing TFIDF should be calculated relative to the train set only, then used to rescale the test set, rather than first doing rescaling on the entire corpus and then splitting. This is because using the test dataset for frequency calculations is violating the principle that you shouldn't use this information.
Let's assume you start with a matrix X of raw term frequencies (not adjusted yet) and y, a vector of classes. The typical order that many code examples show is:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
vec = TfidfTransformer()
#rescale X by its own frequencies, then split
X = vec.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#...now fit a model
but the correct thing should be the following:
vec = TfidfTransformer()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#store rescaling based on X_train frequencies alone
vec.fit(X_train)
#resacale each (transform) by the same model
X_train = vec.transform(X_train)
X_test = vec.transform(X_test)
#...now fit a model
Okay, now the main question: I want to conduct some kind of cross-validation, perhaps with GridSearchCV, where I can feed it a set of potential model parameters and conduct several splits of the data for each one. The typical way to do this is to build a model pipeline and then feed it into the cross-validation utility. Since pipelines are kind of black boxes that are hard to view the details of, I just wanted to verify whether, if TfidfTransformer is included as a step in the pipeline, that it does the adjustment correctly, as I've mentioned above, by conducting the adjustment on the training data of each split.

SKLearn Predicting using new Data

I've tried out Linear Regression using SKLearn. I have data something along the lines of: Calories Eaten | Weight.
150 | 150
300 | 190
350 | 200
Basically made up numbers but I've fit the dataset into the linear regression model.
What I'm confused on is, how would I go about predicting with new data, say I got 10 new numbers of Calories Eaten, and I want it to predict Weight?
regressor = LinearRegression()
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test) ??
But how would I go about making only my 10 new data numbers of Calories Eaten and make it the Test Set I want the regressor to predict?

You are correct, you simply call the predict method of your model and pass in the new unseen data for prediction. Now it also depends on what you mean by new data. Are you referencing data that you do not know the outcome of (i.e. you do not know the weight value), or is this data being used to test the performance of your model?
For new data (to predict on):
Your approach is correct. You can access all predictions by simply printing the y_pred variable.
You know the respective weight values and you want to evaluate model:
Make sure that you have two separate data sets: x_test (containing the features) and y_test (containing the labels). Generate the predictions as you are doing with the y_pred variable, then you can calculate its performance using a number of performance metrics. Most common one is the root mean square, and you simply pass the y_test and y_pred as parameters. Here is a list of all the regression performance metrics supplied by sklearn.
If you do not know the weight value of the 10 new data points:
Use train_test_split to split your initial data set into 2 parts: training and testing. You would have 4 datasets: x_train, y_train, x_test, y_test.
from sklearn.model_selection import train_test_split
# random state can be any number (to ensure same split), and test_size indicates a 25% cut
x_train, y_train, x_test, y_test = train_test_split(calories_eaten, weight, test_size = 0.25, random_state = 42)
Train model by fitting x_train and y_train. Then evaluate model's training performance by predicting on x_test and comparing these predictions with the actual results from y_test. This way you would have an idea of how the model performs. Furthermore, you can then predict the weight values for the 10 new data points accordingly.
It is also worth reading further on the topic as a beginner. This is a simple tutorial to follow.

You have to select the model using model_selection in sklearn then train and fit the dataset.
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(eaten, weight)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

What I'm confused on is, how would I go about predicting with new
data, say I got 10 new numbers of Calories Eaten, and I want it to
predict Weight?
Yes, Calories Eaten represents the independent variable while Weight represent dependent variable.
After you split the data into training set and test set the next step is to fit the regressor using X_train and y_train data.
After the model is trained you can predict the results for X_test method and so we got the y_pred.
Now you can compare y_pred (predicted data) with y_test which is real data.
You can also use score method for your created linear model in order to get the performance of your model.
score is calculated using R^2(R squared) metric or Coefficient of determination.
score = regressor.score(x_test, y_test)
For splitting the data you can use train_test_split method.
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(eaten, weight, test_size = 0.2, random_state = 0)

python: taking random sample from data but keeping the same distribution

I have a training data that has 20,000 and more instances, split into 3 classes, with a distribution like A=10%, B=20%, C=70%. Is there a way in sklearn or pandas or anything else that can take a sample of 10% from this data but at the same time respecting the distribution of different classes? As I need do grid search on the data but the original dataset is too high dimensional (20,000 x 12,000 feature dimension)
The train_test_split will keep the distribution but it only splits the entire dataset into two sets, which are still too large.
Thanks

You should use Stratifiefkfold. The folds are made by preserving the percentage of samples for each class.
See the documentation for using it.

The train_test_split function allows a definition of the size of the training data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
See the docs

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.