Multi labeled image classification with imbalanced data, how to split it?

Multi labeled image classification with imbalanced data, how to split it? - python

I am working multi labeled image classification. This is my data frame:
[UPDATED]
As you can see images labeled with 26 features. "1" means exist, "0" means not exist.
My problem is in many of label has imbalanced data. For example:
[1] train_df.value_counts('Eyeglasses')
Output:
Eyeglasses
0 54735
1 1265
dtype: int64
[2] train_df.value_counts('Double_Chin')
Output:
Double_Chin
0 55464
1 536
dtype: int64
How can I split it both of for training and validation data as a balanced?
[UPDATE]
I tried to
from imblearn.over_sampling import SMOTE
smote = SMOTE()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
X_train_smote, y_train_smote = smote.fit_sample(X_train, y_train)
ValueError: Imbalanced-learn currently supports binary, multiclass and
binarized encoded multiclasss targets. Multilabel and multioutput
targets are not supported.

Your question mixes two concepts: splitting a multi-class, multi-label image dataset into subsets which have proportional representation, and resampling methods to deal with class imbalance. I am going to focus on just the splitting part of the problem, since that's what the title is about.
I would use a stratified-shuffle-split so to make sure that each subset has equal reprentation. Here's a handy visual for stratified sampling from Wikipedia:
For this I recommend skmultilearn's IterativeStratification method. It supports multi-label datasets.
from skmultilearn.model_selection.iterative_stratification import IterativeStratification
stratifier = IterativeStratification(
n_splits=2, order=2, sample_distribution_per_fold=[1.0 - train_fraction, train_fraction],
)
# this class is a generator that produces k-folds. we just want to iterate it once to make a single static split
# NOTE: needs to be computed on hard labels.
train_indexes, everything_else_indexes = next(stratifier.split(X=img_urls, y=labels))
# s3url array shape (N_samp,)
x_train, x_else = img_urls[train_indexes], img_urls[everything_else_indexes]
# labels array shape (N_samp, n_classes)
Y_train, Y_else = labels[train_indexes, :], labels[everything_else_indexes, :]
I wrote a more complete solution, including unit tests, in a blog post.
One downside with skmultilearn is that it is not very well maintained and has some broken functionality. I documented a few of these sharp corners and gotchas in my blog post. Also note that this stratification procedure is painfully slow when you get to several million images because the stratifier only uses a single CPU.

Related

Oversampling on binary classification

everyone.
I am doing a binary classification on a huge dataset (190 columns, 500K records). The target values are 0 and 1. However, when I do the oversampling with SMOTE, new target values in the y-vector are created (0, 1, 2 for example). I do not know how to avoid that. I would like to know how to limit the oversampling values to 0 and 1.
this is what I am doing:
over = SMOTE(sampling_strategy='minority')
X, y = over.fit_resample(X, y)
y's vector type is int64.
Also, I am modeling using an ANN using Keras. The training data set available for me is split into three different datasets: train, val and test, for training, validation and testing purposes. I wonder which one should be OVERsampled (or UNDERsampled), or if I should OVERsample (or UNDERsample) BEFORE splitting.
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.15)
WARNING: my model will be evaluated later on using a much larger dataset (1000K records) which is the REAL test dataset from the company I am working for, and whose target values are UNKNOWN for me and my model.
Thanx
Johnny

Since you have 190 columns, 500K records, SMOTE will not work well as SMOTE technically interpolations techniques. It may suffer from curse of dimensionality. OVERsample and UNDERsample approaches also have some limitations. I would prefer to use class weight. Also, make sure that all data pre-processing are performed separately on test data and train data. Here is paper outlined the framework: https://www.sciencedirect.com/science/article/pii/S2666827022000585

Can a pipeline works for more than one class?

my dataset consist of 10 feature (10 columns) for input and the last 3 columns for 3 different output. If I use one column for output, for example y = newDf.iloc[:, 10].values , it works; but if I use all 3 columns it gives me an error at pipe_lr.fit and says: y should be a 1d array, got an array of shape (852, 3) instead.
How can I pass y ?
X = newDf.iloc[:, 0:10].values
y = newDf.iloc[:, 10:13].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
pipe_lr = make_pipeline(StandardScaler(),
PCA(n_components=2),
LogisticRegression(random_state=1, solver='lbfgs'))
pipe_lr.fit(X_train, y_train)

The pipeline itself does not care about the format for y, it just hands it over to each step. In your case, it's the LogisticRegression, which indeed is not set up for multi-label classification. You can manage it using the MultiOutputClassification wrapper:
pipe_lr = make_pipeline(
StandardScaler(),
PCA(n_components=2),
MultiOutputClassifier(LogisticRegression(random_state=1, solver='lbfgs'))
)
(There is also a MultiOutputRegressor, and more-complicated things like ClassifierChain and RegressorChain. See the User Guide. However, there is not to my knowledge a builtin way to mix and match regression and classification tasks.)

Simply put, No.
What you want is called multi-label learning, not supported by Scikit-Learn.
You should train three models, each having a label.

Split dataset for multi-output image classification with threshold

I use StratifiedShuffleSplit from sklearn.model_selection to split dataset. But it is just for 1 label. So how to Stratified split for both label, and add a threshold for the number of samples in each class in each label?
Sorry for my bad english.
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=val_size, random_state=42)
for train_index, val_index in split.split(tdf, tdf['layout']):
train_df = tdf.loc[train_index]
val_df = tdf.loc[val_index]

Could you elaborate more on (in each label in each class) please?
This API can be used to ensure that both the train and test sets have the proportion of examples in each class. In other words, we use it to make sure that we will have samples from each class in both train and test sets. It does not handle the problem of imbalanced classes which is another problem.
If you mean 1 class by 1 label, then you need to check the number of classes before splitting
tdf['layout'].nunique()
The output should be: 2
Also, in your code you are only splitting the independent features (input) and not splitting the output/target (in your case: tdf['layout])
Add this line of code:
y_train, y_val = tdf['layout][train_index], tdf['layout][val_index]
Also, just make sure that 'layout' column is dropped from tdf

Why do I get different results when I do a manual split of test and train data as opposed to using the Python splitting function

If I run a simple dtree regression model using data via the train_test_split function, I get nice r2 scores, and low mse values.
training_data = pandas.read_csv('data.csv',usecols=['y','x1','x2','x3'])
y = training_data.iloc[:,0]
x = training_data.iloc[:,1:]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
regressor = DecisionTreeRegressor(random_state = 0)
# fit the regressor with X and Y data
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
yet if I split the data file manually into two files 2/3 train and 1/3 test. there is a column called human which gives a value 1 to 9 which human it is, i use human 1-6 for training, and 7-9 for test
i get negative r2 scores, and high mse
training_data = pandas.read_csv("train"+".csv",usecols=['y','x1','x2','x3'])
testing_data = pandas.read_csv("test"+".csv", usecols=['y','x1','x2','x3'])
y_train = training_data.iloc[:,training_data.columns.str.contains('y')]
X_train = training_data.iloc[:,training_data.columns.str.contains('|'.join(['x1','x2','x3']))]
y_test = testing_data.iloc[:,testing_data.columns.str.contains('y')]
X_test = testing_data.iloc[:,testing_data.columns.str.contains('|'.join(l_vars))]
y_train = pandas.Series(y_train['y'], index=y_train.index)
y_test = pandas.Series(y_test['y'], index=y_test.index)
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
I was expecting more or less the same results, and all the data types seem the same for both calls.
What am I missing?

I'm assuming that both methods here actually do what you intend doing and the shapes of your X_train/test and y_train/tests are the same coming from both methods. You can either plot the underlying distributions of your datasets or compare your second implementation against a cross-validated model (for better rigour).
Plot the distributions (i.e. make bar charts/density plots) of the labels (y) in the initial train - test sets vs in the second ones (from the manual implementation). You can dive deeper and also plot your other columns in the data, see if anything about the distributions of your data is different between the resulting sets of the two implementations. If the distributions are different than it makes sense you get discrepancies between your two models. If your discrepancy is huge, it could be your labels (or other columns) are actually sorted for your manual implementation, so then you get very different distributions in the datasets you're comparing.
Also, if you want to make sure that your manual splitting results is a 'representative' set(that would generalise well) based on model results instead of underlying data distributions, I would compare it against the results of a cross-validated model, not one single set of results.
Essentially, although the probability is small and the train_test_split does some shuffling, you could essentially get a 'train/test' pair that is performing well just out of luck. (To reduce the chance of that without doing cross validation, I'd suggest making use of the stratify argument of the train_test_split function. then at least you're sure the first implementation 'tries harder' to get balanced train/test pairs.)
If you decide to cross validate (with test_train_split), you get an average model prediction for the folds and a confidence intervals around it and can check if your second model results fall within that interval. If it doesn't again, it just means your split is actually 'corrupted' somehow (e.g. by having sorted values).
P.S. I'd also add that Decision Trees are models that are known to overfit massively [1]. Maybe use a random forest instead? (you should get more stable results due to bootstraping/bagging which would act similarly to cross-validating to reduce the chance of overfitting.)
1 - http://cv.znu.ac.ir/afsharchim/AI/lectures/Decision%20Trees%203.pdf

The train_test_split function from scikit-learn uses sklearn.model_selection.ShuffleSplit as per the documentation and this means, this method randomize your data when splitting.
When you split manually, you didn't randomize it so if your labels is not spreaded evenly throughout your dataset, you'll of course have performance issue since your model won't be generalized enough due to train data not containing enough sample of other labels.
If my suspicion is correct, you should get similar result by passing shuffle=False into train_test_split.

suppose your dataset contains this data.
1 + 1 = 2
2 + 2 = 4
4 - 4 = 0
2 - 2 = 0
So suppose you want a 50% train split. train_test_split shuffles it like this so it genaralizes better
1+1=2
2-2= 0
So it knows what do to when it sees this data
2+2
4-4#since it learned both addition and subtraction
But when you manually shuffle it like this
1 + 1 = 2
2 + 2 =4#only learned addition
It doesn't know what do do when it sees this data
2 - 2
4 - 4#test data is subtraction
Hope this answers you question

It may sound like a simple check but..
In the first example you are reading data from 'data.csv', in the second example you are reading from 'train.csv' and 'test.csv'. Since you say you split the file manually, I have a question about how that was done. If you simply cut the file at the 2/3's mark and saved as 'train.csv' and the remaining as 'test.csv' then you have unknowingly made an assumption about the uniformity of the data in the file. Data files can have an ordered structure which would skew the training or testing, which is why the train_test_split randomizes the rows. If you haven't already done it, try to randomize the rows first and then write to your train and test csv file to ensure you have a homogeneous dataset.
The other line that might be out of place is line 6:
X_test = testing_data.iloc[:,testing_data.columns.str.contains('|'.join(l_vars))]
Perhaps the l_vars contains something other than what you expect. Maybe it should read the following to be more consistent.
X_test = testing_data.iloc[:,testing_data.columns.str.contains('|'.join(['x1','x2','x3']))]
good luck, let us know if this helps.

Ttfidfvectorizer adjust test by training frequencies in pipeline during cross-validation

I have a text classification task based on documents, where I expect the classes are related to word frequencies. Because of the specific nature of my application, where I have a corpus that will grow over time and want to classify new documents as they arrive, I have used FeatureHasher rather than the existing TFidfVectorizer (which both vectorizes and does adjustment), since the vocabulary size can grow with new documents.
As discussed here for instance (https://stats.stackexchange.com/questions/154660/tfidfvectorizer-should-it-be-used-on-train-only-or-traintest), it seems correct to me that the term frequencies when doing TFIDF should be calculated relative to the train set only, then used to rescale the test set, rather than first doing rescaling on the entire corpus and then splitting. This is because using the test dataset for frequency calculations is violating the principle that you shouldn't use this information.
Let's assume you start with a matrix X of raw term frequencies (not adjusted yet) and y, a vector of classes. The typical order that many code examples show is:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
vec = TfidfTransformer()
#rescale X by its own frequencies, then split
X = vec.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#...now fit a model
but the correct thing should be the following:
vec = TfidfTransformer()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#store rescaling based on X_train frequencies alone
vec.fit(X_train)
#resacale each (transform) by the same model
X_train = vec.transform(X_train)
X_test = vec.transform(X_test)
#...now fit a model
Okay, now the main question: I want to conduct some kind of cross-validation, perhaps with GridSearchCV, where I can feed it a set of potential model parameters and conduct several splits of the data for each one. The typical way to do this is to build a model pipeline and then feed it into the cross-validation utility. Since pipelines are kind of black boxes that are hard to view the details of, I just wanted to verify whether, if TfidfTransformer is included as a step in the pipeline, that it does the adjustment correctly, as I've mentioned above, by conducting the adjustment on the training data of each split.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.