Balancing an Unbalanced Dataset with K-Fold Cross Validation - python

I'm trying to train/validate a CNN using Pytorch on an unbalanced image dataset (class 1:250 images, class 0: 4000ish images), and right now, I've tried augmentation solely on my training set (thanks #jodag). However, my model is still learning to favor the class with significantly more images.
I want to find ways to compensate for my unbalanced data set.
I thought about using oversampling/undersampling using the imbalanced data sampler (https://github.com/ufoym/imbalanced-dataset-sampler), but I already use a sampler to select indices for my 5-fold validation. Is there a way I could implement cross-validation using the code below and also add this sampler? Similarly, is there a way to augment one label more frequently than the other? Along the lines of these questions, are there any alternative easier ways that I could address my unbalanced dataset that I haven't looked into yet?
Here's an example of what I have so far
total_set = datasets.ImageFolder(PATH)
KF_splits = KFold(n_splits= 5, shuffle = True, random_state = 42)
for train_idx, valid_idx in KF_splits.split(total_set):
#sampler to get indices for cross validation
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)
#Use a wrapper to apply augmentation only to training set
#These are dataloaders that pull images from the same folder but sort into validation and training sets
#Though transforms augment only the training set, it doesn't address
#the underlying issue of a heavily unbalanced dataset
train_loader = torch.utils.data.DataLoader(
WrapperDataset(total_set, transform=data_transforms['train']),
batch_size=32, sampler=ImbalancedDatasetSampler(total_set))
valid_loader = torch.utils.data.DataLoader(
WrapperDataset(total_set, transform=data_transforms['val']),
batch_size=32)
print("Fold:" + str(i))
for epoch in range(epochs):
#Train/validate model below
`
Thank you for your time and help!

Related

Different accuracy when splitting data with train_test_split than loading csv file afterwards

I have build a model to predict if a customer is a business or a private customer. After training the model I predict the class of 1000 datasets which I didn’t use for the training. This prediction will be saved in a csv file.
Now I have two different behaviours:
Splitting sample data in the program
When I create the sample with train, sample = train_test_split(train, test_size=1000, random_state=seed) then prediction gets the same accuracy during the training (same value as validation).
Splitting sample data in advance and then loading it
But when I split the data manually before learning by taking 1000 datasets of the original csv file and copying it in a new sample csv file which I am loading before doing the prediction after learning, I got a much worse result (e.g. 76% instead of 90%).
This behaviour doesn’t make sense in my eyes since the original data (the csv file for training) was also shuffled in advanced and therefore I should get the same result.
Here is the relevant code of the mentioned case distinction:
1. Splitting sample data in the program
Splitting
def getPreProcessedDatasetsWithSamples(filepath, batch_size):
path = filepath
data = __getPreprocessedDataFromPath(path)
train, test = train_test_split(data, test_size=0.2, random_state=42)
train, val = train_test_split(train, test_size=0.2, random_state=42)
train, sample = train_test_split(train, test_size=1000, random_state=seed)
train_ds = __df_to_dataset(train, shuffle=False, batch_size=batch_size)
val_ds = __df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = __df_to_dataset(test, shuffle=False, batch_size=batch_size)
sample_ds = __df_to_dataset(sample, shuffle=False, batch_size=batch_size)
return (train_ds, val_ds, test_ds, sample, sample_ds)
Prediction with sample, sample_ds
def savePredictionWithSampleToFileKeras(model, outputName, sample, sample_ds):
predictions = model.predict(sample_ds)
loss, accuracy = model.evaluate(sample_ds)
print("Accuracy of sample", accuracy)
sample['prediction'] = predictions
sample.to_csv("./saved_samples/" + outputName + ".csv")
Accuracy of sample: 90%
2. Splitting sample data in advance and then loading it
Prediction by loading csv file
def savePredictionToFileKeras(model, sampleFilePath, outputName, batch_size):
sample_ds = preprocessing.getPreProcessedSampleDataSets(sampleFilePath, batch_size)
sample = preprocessing.getPreProcessedSampleDataFrames(sampleFilePath)
predictions = model.predict(sample_ds)
loss, accuracy = model.evaluate(sample_ds)
print("Accuracy of sample", accuracy)
sample['prediction'] = predictions
sample.to_csv("./saved_samples/" + outputName + ".csv")
Accuracy of sample: 77%
EDIT
Observation: When I load the whole data as sample data, I get the same value as the validation value as expected (ca. 90%) but when I just randomize the line order of the same file, I get a value of 82%. As my understanding the accuracy should be the same, since the files are equal.
Some additional information:
I have changed the implementation form the sequential to the functional API. I’m using Embeddings in the pre-processing (I also tried One-Hot-Encoding without success).
Finally I found the problem: I am using a Tokenizer to preprocess a NAME and STREET column in a way that I am converting each word to a value which indicates how often the word occurs. In the case I am using train_test_split I use the same overall words of all data for converting the words, but when I am loading the sample dataset afterwards I use only the words which occurs in the sample dataset. For instance, the word “family” could be the most used word overall but just the third in the sample dataset and therefore the encoding would be totally wrong.
After using the same tokenizer instance for all data, I get the same high accuracy for all the data.
Both the above methods don't predict correct accuracy. Accuracy will be good measurement only if your data is balanced. For unbalanced data it is not good measure and it wont be correct always. each time you change the accuracy will change.
You should be using K fold cross validation firstly so that all data points will be used for training model. In case if your data set is not balanced you can try different balance techniques like over sampling or under sampling on the train data and validate the model.

How to divide the dataset when it is distributed

Now I want to divide a dataset into two parts: the train set and validation set. I know that on a single GPU I can do this using a sampler:
indices = list(range(len(train_data)))
train_loader = torch.utils.data.DataLoader(
train_data, batch_size=args.batch_size,
sampler=torch.utils.data.sampler.SubsetRandomSampler(indices[:split]),
pin_memory=True, num_workers=2)
But when I want to train it in a parallel way using torch.distributed, I have to use another sampler, namely, sampler = torch.utils.data.distributed.DistributedSampler(train_data)
So how should I do to use the two samplers, so that I can divide the dataset and distribute it at the same time?
Thank you very much for any help!
You can split torch.utils.data.Dataset before creating torch.utils.data.DataLoader.
Simply use torch.utils.data.random_split like this:
train, validation =
torch.utils.data.random_split(
dataset,
(len(dataset)-val_length, val_length)
)
This would give you two separate datasets which could be used with dataloaders however you wish.

Train and validation score is high but very Poor Test Accuracy

I am working on multi-label image classification, i am using inception net as my base architecture.
after the complete training i am getting, training accuracy > 90% and validation accuracy > 85% but i am getting 17% accuracy on test data.
Model training -->
model = Model(pre_trained_model.input, x)
model.compile(loss='categorical_crossentropy',
optimizer=RMSprop(lr=0.0001),#'adam'
metrics=['acc'])
history = model.fit_generator(
train_generator,
steps_per_epoch=600,#total data/batch size
epochs=100,
validation_data=validation_generator,
validation_steps=20,
verbose=1,callbacks = callbacks)
Testing on the trained model:
test_generator = test_datagen.flow_from_directory(
test_dir,target_size=(128, 128),batch_size=1,class_mode='categorical')
filenames = test_generator.filenames
nb_samples = len(filenames)
prediction = test_model.predict_generator(test_generator,steps=nb_samples,verbose=1)
Saving the results to Pandas
predicted_class_indices = np.argmax(prediction,axis=1)
labels = (train_generator.class_indices) #geting names of classes from folder structure
labels = dict((v,k) for k,v in labels.items())
predictions = [k for k in predicted_class_indices]
results=pd.DataFrame({"image_name":filenames,
"label":predictions})
results['image_name'] = [each.split("\\")[-1] for each in results['image_name']]
Everything looks fine but still i am getting very poor prediction.
kidly help me to fugure out, where i am making the mistakes.
It can be the case that the images in your dataset are arranged in such a way that test images are previously unseen by the model and so the accuracy drops significantly.
What I recommend is for you to try to use K-fold cross validation or even Stratified K-fold cross validation. The benefit here is that your dataset will be splitted in, let's say 10 'batches'. Every iteration (out of 10) one batch will be the test batch and all the others will be train batches. The next iteration, test batch from the previous step becomes train batch and some other batch becomes test batch. It's important to denote that every batch will be the test batch only once. Another benefit of the Stratified K-fold is that it will take into account the class labels and try to split the classes in such way that every batch has approximately the same distribution of classes.
Another way to achieve some better results is to just shuffle the images and pick the training ones and test ones then.

Using Scikit-Learn GridSearchCV for cross validation with PredefinedSplit - Suspiciously good cross validation results

I'd like to use scikit-learn's GridSearchCV to perform a grid search and calculate the cross validation error using a predefined development and validation split (1-fold cross validation).
I'm afraid that I've done something wrong, because my validation accuracy is suspiciously high. Where I think I'm going wrong: I'm splitting up my training data into development and validation sets, training on the development set and recording the cross validation score on the validation set. My accuracy might be inflated because I am really training on a mix of the development and validation sets, then testing on the validation set. I'm not sure if I'm using scikit-learn's PredefinedSplit module correctly. Details below:
Following this answer, I did the following:
import numpy as np
from sklearn.model_selection import train_test_split, PredefinedSplit
from sklearn.grid_search import GridSearchCV
# I split up my data into training and test sets.
X_train, X_test, y_train, y_test = train_test_split(
data[training_features], data[training_response], test_size=0.2, random_state=550)
# sanity check - dimensions of training and test splits
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
# dimensions of X_train and x_test are (323430, 26) and (323430,1) respectively
# dimensions of X_test and y_test are (80858, 26) and (80858, 1)
''' Now, I define indices for a pre-defined split.
this is a 323430 dimensional array, where the indices for the development
set are set to -1, and the indices for the validation set are set to 0.'''
validation_idx = np.repeat(-1, y_train.shape)
np.random.seed(550)
validation_idx[np.random.choice(validation_idx.shape[0],
int(round(.2*validation_idx.shape[0])), replace = False)] = 0
# Now, create a list which contains a single tuple of two elements,
# which are arrays containing the indices for the development and
# validation sets, respectively.
validation_split = list(PredefinedSplit(validation_idx).split())
# sanity check
print(len(validation_split[0][0])) # outputs 258744
print(len(validation_split[0][0]))/float(validation_idx.shape[0])) # outputs .8
print(validation_idx.shape[0] == y_train.shape[0]) # True
print(set(validation_split[0][0]).intersection(set(validation_split[0][1]))) # set([])
Now, I run a grid search using GridSearchCV. My intention is that a model will be fit on the development set for each parameter combination over the grid, and the cross validation score will be recorded when the resulting estimator is applied to the validation set.
# a vanilla XGboost model
model1 = XGBClassifier()
# create a parameter grid for the number of trees and depth of trees
n_estimators = range(300, 1100, 100)
max_depth = [8, 10]
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)
# A grid search.
# NOTE: I'm passing a PredefinedSplit object as an argument to the `cv` parameter.
grid_search = GridSearchCV(model1, param_grid,
scoring='neg_log_loss',
n_jobs=-1,
cv=validation_split,
verbose=1)
Now, here is where a red flag is raised for me. I use the best estimator found by the gridsearch to find the accuracy on the validation set. It's very high - 0.89207865689639176. What's worse is that it's almost identical to the accuracy that I get if I use the classifier on the data development set (on which I just trained) - 0.89295597192591902. BUT - when I use the classifier on the true test set, I get a much lower accuracy, roughly .78:
# accurracy score on the validation set. This yields .89207865
accuracy_score(y_pred =
grid_result2.predict(X_train.iloc[validation_split[0][1]]),
y_true=y_train[validation_split[0][1]])
# accuracy score when applied to the development set. This yields .8929559
accuracy_score(y_pred =
grid_result2.predict(X_train.iloc[validation_split[0][0]]),
y_true=y_train[validation_split[0][0]])
# finally, the score when applied to the test set. This yields .783
accuracy_score(y_pred = grid_result2.predict(X_test), y_true = y_test)
To me, the almost exact correspondence between the model's accuracy when applied to the development and validation datasets, and the significant loss in accuracy when applied to the test set is a clear sign that I'm training on the validation data by accident, and thus my cross validation score is not representative of the true accuracy of the model.
I can't seem to find where I went wrong - mostly because I don't know what GridSearchCV is doing under the hood when it receives a PredefinedSplit object as the argument to the cv parameter.
Any ideas where I went wrong? If you need more details/elaboration, please let me know. The code is also in this notebook on github.
Thanks!
You need to set refit=False (not a default option), otherwise the grid search will refit the estimator on the whole dataset (ignoring cv) after the grid search completes.
Yes, there was a data leaking problem for the validation data. You need to set refit = False for GridSearchCV and it will not refit the whole data including training and validation data.

How do I handle unbalanced classes in my classifier?

I am using LinearSVM to classify my documents into categories. However, my dataset is unbalanced with some categories having 48,000 documents under them and some as small as 100. When I train my model, even with using Stratified KFold, I see that the category with 48,000 documents get a larger portion of documents(3300) compared to others. In such a case, it would definitely give me biased predictions. How can I make sure this selection isn't biased?
kf=StratifiedKFold(labels, n_folds=10, shuffle=True)
for train_index, test_index in kf:
X_train, X_test = docs[train_index],docs[test_index]
Y_train, Y_test = labels[train_index],labels[test_index]
Then I'm writing these(X_train, Y_train) to a file, computing the feature matrix and passing them to the classifier as follows:
model1 = LinearSVC()
model1 = model1.fit(matrix, label_tmp)
pred = model1.predict(matrix_test)
print("Accuracy is:")
print(metrics.accuracy_score(label_test, pred))
print(metrics.classification_report(label_test, pred))
The StratifiedKFold method by default takes into account the ratio of labels in all your classes, meaning that each fold will have the exact (or close to exact) ratio of each label in that sample. Whether you want to adjust for this or not is somewhat up to you - you can either let the classifier learn some kind of bias for labels with more samples (as you are now), or you can do one of two things:
Construct a separate train / test set, where the training set has equal number of samples in each label (therefore in your case, each class label in the training set might only have 50 examples, which is not ideal). Then you can train on your training set and test on the rest. If you do this multiple times with different samples, you are essentially doing k-fold cross validation, just choosing your sample sizes in a different way.
You can change your loss function (i.e. the way you initialize LinearSVC() to account for the class imbalances. For example: model = LinearSVC(class_weight='balanced'). This will cause the model to learn a loss function that takes class imbalances into account.

Categories

Resources