How to divide the dataset when it is distributed - python

Now I want to divide a dataset into two parts: the train set and validation set. I know that on a single GPU I can do this using a sampler:
indices = list(range(len(train_data)))
train_loader = torch.utils.data.DataLoader(
train_data, batch_size=args.batch_size,
sampler=torch.utils.data.sampler.SubsetRandomSampler(indices[:split]),
pin_memory=True, num_workers=2)
But when I want to train it in a parallel way using torch.distributed, I have to use another sampler, namely, sampler = torch.utils.data.distributed.DistributedSampler(train_data)
So how should I do to use the two samplers, so that I can divide the dataset and distribute it at the same time?
Thank you very much for any help!

You can split torch.utils.data.Dataset before creating torch.utils.data.DataLoader.
Simply use torch.utils.data.random_split like this:
train, validation =
torch.utils.data.random_split(
dataset,
(len(dataset)-val_length, val_length)
)
This would give you two separate datasets which could be used with dataloaders however you wish.

Related

Split torch dataset without shuffling

I'm using Pytorch to run Transformer model. when I want to split data (tokenized data) i'm using this code:
train_dataset, test_dataset = torch.utils.data.random_split(
tokenized_datasets,
[train_size, test_size])
torch.utils.data.random_split using shuffling method, but I don't want to shuffle. I want to split it sequentially.
Any advice? thanks
The random_split method has no parameter that can help you create a non-random sequential split.
The easiest way to achieve a sequential split is by directly passing the indices for the subset you want to create:
# Created using indices from 0 to train_size.
train_dataset = torch.utils.data.Subset(tokenized_datasets, range(train_size))
# Created using indices from train_size to train_size + test_size.
test_dataset = torch.utils.data.Subset(tokenized_datasets, range(train_size, train_size + test_size))
Refer: PyTorch docs.

deep learning Denoisy Model with imageDataGenerator

I have this code:
epochs =50
batch_size = 5
validation_split = 0.2
datagen = tf.keras.preprocessing.image.ImageDataGenerator(validation_split=validation_split )
train_generator = datagen.flow(
X_train_noisy, y_train_denoisy, batch_size=batch_size,
subset='training'
)
val_generator = datagen.flow(
X_train_noisy, y_train_denoisy, batch_size=batch_size,
subset='validation'
)
history = model.fit(train_generator,
steps_per_epoch=(len(X_train_noisy)*(1-validation_split)) // batch_size, epochs=epochs,
validation_data = val_generator, validation_steps=(len(X_train_noisy)*validation_split)//batch_size)
X_train_noisy and y_train_denoisy are ndarray ([20,512,512,1]) p.e. But I get this error:
training and validation subsets have different number of classes after the split
How can I solve that?
thanks!
probably what happened is when the data is split for training and validation, the set of files selected for validation did not include any files in one or more of the classes. This can happen when your data set is small. Try increasing the validation_split to a larger value say like .5 and see if the problem goes away. It should. Then reduce the size of the validation split until the error reoccurs. That will determine the minimum split value you can use. Remember the split is randomized so set the split value at something above the minimum value.
Another (BETTER) alternative is to split the data using sklearn train_test_split. This function has a parameter stratify that splits the data but ensures that all classes are included in the two component. See code below
from sklearn.model_selection import train_test_split
X_train_noisy, X_valid_noisy, y_train_denoisy, y_valid_denoisy=train_test_split(X_train_noisy,
y_train_denoisy, test_size=validation_split,
shuffle=True, random_state=123,
stratify=y_train_denoisy)
now use these split variable in model.fit

How to make train_test_split on a dataset created with make_csv_dataset

For a small dataset, I was using scikit-learn test_train_split on a dataframe of the whole dataset as
from sklearn.model_selection import train_test_split
train, test = train_test_split(features_dataframe, test_size=0.2)
train, test = train_test_split(train, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
And it simply create a test, train, validation split on my dataset.
Now, I want to perform data-loading from the disk i.e., my csv files. So, I'm using the experimental tf.data function make_csv_dataset. What I have done is
import tensorflow as tf
defaults=[float()]*len(selected_columns)
data_set=tf.data.experimental.make_csv_dataset(
file_pattern = "./processed/*/*/*.csv",
column_names=all_columns, # array with all columns labels
select_columns=selected_columns, # array with desired column labels
column_defaults=defaults, # default column values
label_name="Target",
batch_size=10,
num_epochs=1,
num_parallel_reads=20,
shuffle_buffer_size=10000,
ignore_errors=True)
As far as I guess is, I have the dataset, but when I try to perform train_test_split of scikit-learn, it don't work and the reason is obvious, the data_set is not loaded yet, its just configured to be loaded.
How, to perform train, test, validation split on this data?
I have gone through some guides, and everyone (as far as I come across), is loading the training data:
overfit_and_underfit
custom_training_walkthrough
estimator
First of all, to have a better control overmy dataset, I used a lower level similar API i.e., CsvDataset. Then I manually spitted dataset in two different folders for test and train splits and loaded separately as
import pathlib
training_csvs = sorted(str(p) for p in pathlib.Path('./../Datasets/path-to-dataset/Train').glob("*/*.csv"))
testing_csvs = sorted(str(p) for p in pathlib.Path('./../Datasets//path-to-dataset/Test').glob("*/*.csv"))
training_dataset=tf.data.experimental.CsvDataset(
training_csvs,
record_defaults=defaults,
compression_type=None,
buffer_size=None,
header=True,
field_delim=',',
use_quote_delim=True,
na_value="",
select_cols=selected_indices
)
print(type(training_dataset))
testing_dataset=tf.data.experimental.CsvDataset(
testing_csvs,
record_defaults=defaults,
compression_type=None,
buffer_size=None,
header=True,
field_delim=',',
use_quote_delim=True,
na_value="",
select_cols=selected_indices
)
print(training_dataset.element_spec)
print(testing_dataset.element_spec)
training_dataset= training_dataset.shuffle(50000)
validate_ds = training_dataset.batch(300).take(100)
train_ds = training_dataset.batch(300, drop_remainder=True).skip(100)
test_ds = testing_dataset.batch(300, drop_remainder=True)
Now, it's working but one problem is left and that is, validation dataset. Ideally, validation dataset should be different for each epoch, but in this case it's same so, training for multiple epochs is not improving performance. If anybody can help to resolve this issue, I would be grateful.

Balancing an Unbalanced Dataset with K-Fold Cross Validation

I'm trying to train/validate a CNN using Pytorch on an unbalanced image dataset (class 1:250 images, class 0: 4000ish images), and right now, I've tried augmentation solely on my training set (thanks #jodag). However, my model is still learning to favor the class with significantly more images.
I want to find ways to compensate for my unbalanced data set.
I thought about using oversampling/undersampling using the imbalanced data sampler (https://github.com/ufoym/imbalanced-dataset-sampler), but I already use a sampler to select indices for my 5-fold validation. Is there a way I could implement cross-validation using the code below and also add this sampler? Similarly, is there a way to augment one label more frequently than the other? Along the lines of these questions, are there any alternative easier ways that I could address my unbalanced dataset that I haven't looked into yet?
Here's an example of what I have so far
total_set = datasets.ImageFolder(PATH)
KF_splits = KFold(n_splits= 5, shuffle = True, random_state = 42)
for train_idx, valid_idx in KF_splits.split(total_set):
#sampler to get indices for cross validation
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)
#Use a wrapper to apply augmentation only to training set
#These are dataloaders that pull images from the same folder but sort into validation and training sets
#Though transforms augment only the training set, it doesn't address
#the underlying issue of a heavily unbalanced dataset
train_loader = torch.utils.data.DataLoader(
WrapperDataset(total_set, transform=data_transforms['train']),
batch_size=32, sampler=ImbalancedDatasetSampler(total_set))
valid_loader = torch.utils.data.DataLoader(
WrapperDataset(total_set, transform=data_transforms['val']),
batch_size=32)
print("Fold:" + str(i))
for epoch in range(epochs):
#Train/validate model below
`
Thank you for your time and help!

How do I handle unbalanced classes in my classifier?

I am using LinearSVM to classify my documents into categories. However, my dataset is unbalanced with some categories having 48,000 documents under them and some as small as 100. When I train my model, even with using Stratified KFold, I see that the category with 48,000 documents get a larger portion of documents(3300) compared to others. In such a case, it would definitely give me biased predictions. How can I make sure this selection isn't biased?
kf=StratifiedKFold(labels, n_folds=10, shuffle=True)
for train_index, test_index in kf:
X_train, X_test = docs[train_index],docs[test_index]
Y_train, Y_test = labels[train_index],labels[test_index]
Then I'm writing these(X_train, Y_train) to a file, computing the feature matrix and passing them to the classifier as follows:
model1 = LinearSVC()
model1 = model1.fit(matrix, label_tmp)
pred = model1.predict(matrix_test)
print("Accuracy is:")
print(metrics.accuracy_score(label_test, pred))
print(metrics.classification_report(label_test, pred))
The StratifiedKFold method by default takes into account the ratio of labels in all your classes, meaning that each fold will have the exact (or close to exact) ratio of each label in that sample. Whether you want to adjust for this or not is somewhat up to you - you can either let the classifier learn some kind of bias for labels with more samples (as you are now), or you can do one of two things:
Construct a separate train / test set, where the training set has equal number of samples in each label (therefore in your case, each class label in the training set might only have 50 examples, which is not ideal). Then you can train on your training set and test on the rest. If you do this multiple times with different samples, you are essentially doing k-fold cross validation, just choosing your sample sizes in a different way.
You can change your loss function (i.e. the way you initialize LinearSVC() to account for the class imbalances. For example: model = LinearSVC(class_weight='balanced'). This will cause the model to learn a loss function that takes class imbalances into account.

Categories

Resources