PyTorch data loading from multiple different-sized datasets

PyTorch data loading from multiple different-sized datasets - python

I have multiple datasets, each with a different number of images (and different image dimensions) in it. In the training loop I want to load a batch of images randomly from among all the datasets but so that each batch only contains images from a single dataset. For example, I have datasets A, B, C, D and each has images 01.jpg, 02.jpg, … n.jpg (where n depends on the dataset), and let’s say the batch size is 3. In the first loaded batch, for example, I may get images [B/02.jpg, B/06.jpg, B/12.jpg], in the next batch [D/01.jpg, D/05.jpg, D/12.jpg], etc.
So far I have considered the following:
Use a different DataLoader for each dataset, e.g. dataloaderA, dataloaderB, etc., and then in each training loop randomly select one of the dataloaders and get a batch from it. However, this will require a for loop and for large number of datasets it would be very slow since it can’t be split among workers to do it in parallel.
Use a single DataLoader with all of the images from all datasets together but with a custom collate_fn which will create a batch using only images from the same dataset. (I’m not sure how exactly to go about this.)
I have looked at the ConcatDataset class but from its source code it looks like if I use it and try getting a new batch the images in it will be mixed up from among different datasets which I don’t want.
What would be the best way to do this? Thanks!

You can use ConcatDataset, and provide a batch_sampler to DataLoader.
concat_dataset = ConcatDataset((dataset1, dataset2))
ConcatDataset.comulative_sizes will give you the boundaries between each dataset you have:
ds_indices = concat_dataset.cumulative_sizes
Now, you can use ds_indices to create a batch sampler. See the source for BatchSampler for reference. Your batch sampler just has to return a list with N random indices that will respect the ds_indices boundaries. This will guarantee that your batches will have elements from the same dataset.

Related

TFRecords in a custom DataGenerator

I am doing an LSTM model in which samples have different number of time steps. I want to optimise my code and performance so I do not want to use masking, but I want to write a generator that will group automatically the batches with the same number of steps. My idea is the following:
For each possible sequence length (ranges from 1-365) create a TFRecord dataset which will have only the samples with that length
In each generator loop, randomly choose sequence length, and take the batch of data from corresponding TFRecord dataset. One option is to read in batches from this TFRecord dataset until it is depleted - this is preferrable if it is costly to open and close TFRecord dataset multiple times.
Otherwise, if it is not costly to open and close TFRecord dataset, and read from the middle, we can randomly choose sequence length for each batch (sounds more robust)
I was able to implement this logic with .csvs (i.e. one csv for files having fixed sequence lengths) following this example, https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly, but I am wondering if I can gain more performance if I would do this with the TFRecords. However, I couldn't find any resources that would teach me how to use them with this degree of flexibility.
Can anybody point me in the right direction here?
Thanks!

How does keras model fit differ when called multiple times with different datasets each time and when called once with all datasets?

I am working with time series datasets where I have two different cases. One where my sequences are of same size and the other one where the sequences are of different lengths. When I have same length sequences I can merge all the datasets and then fit the model once.
But for different length sequences, I was wondering how differently should the keras model.fit will behave
if the models are fitted with each different length sequences one by one with batch size=length of sequence
if the models are fitted once with all the sequences merged together having a fixed batch size
And based on the given scenario what should be the correct or better course of action?

In first scenario, the weights will be optimized first with first dataset and then will be changed(updated) for the second dataset and so on. In second scenario, you are simultaneously asking the model to learn patterns from all the datasets. This means the weights will be adjusted according to all the datasets all at once. I will prefer the second approach because NN have the tendency to forget/break when trained on new datasets. They are more likely to focus on the data they have seen recently.

Training design / sequential loading of images for Mask-RCNN

I am training a deep learning model using Mask RCNN from the following git repository: matterport/Mask_RCNN. I rely on a heavy augmentation of my dataset (original dataset: 59 images of 1988x1355x3 with each > 80 annotations), which I store locally (necessary to evaluate type/degree of augmentation vs validation metrics). The augmented dataset counts 6000 images. This dataset varies in x and y dimensions of the image, because of reducing resolution and affine transformations - I assume the different x,y-dimensions will not affect the final tests.
However, my Python kernel crashes whenever I load more than 'X' images to train the model.
Hence, I came up with the idea of splitting the dataset in sub-datasets and iterate through the sub-dataset, using the 'last' trained weights as starting point for the new round. But I am not sure if the results will be the same (read: same, taken the stochastic nature of 'stochastic gradient descent' into account)?
I wonder, if the results would be the same, if I don't iterate through the sub-datasets per epoch, but train Y epochs (eg. 20 for 'heads' only, 10 for 'all layers')?
Yet, I am sure this is not the most efficient way of solving this issues. Ideas for improvement are welcome.
Note, I am not using keras.preprocessing.image.ImageDataGenerator(), as I have understood it, it randomly generates data and feeds it to the model by replacing the input for the epoch, whereas I would like to feed the whole dataset to the model.

I came up with the idea of splitting the dataset in sub-datasets and iterate through the sub-dataset, using the 'last' trained weights as starting point for the new round. But I am not sure if the results will be the same?
You are doing the same thing as ImageDataGenerator is doing (creating your own mini-batches) but less optimally. The result will be same with respect to what?
If you mean with respect to a model that was trained with all the data in a single batch? - Most probably not. As a lower batch means slower convergence. But this can be solved by training for more epochs.
Another issue is reproducibility. If you want to reproduce your model with same results each time just use seeds.
import random
random.seed(1)
import numpy as np
np.random.seed(1)
import tensorflow
tensorflow.random.set_seed(1)
Another concept is gradient accumulation. It will help you train you with high batch size without keeping too many images in memory at a time.
https://github.com/CyberZHG/keras-gradient-accumulation
Finally, keras.preprocessing.image.ImageDataGenerator() in facts trains on the whole dataset, it just chooses a random sample at each step (you're doing the same thing with your so-called sub-datasets).
You can seed the ImageDataGenerator so it is reproducible and not entirely random.

splitting tfrecords dataset with multiple features

I have an image classification task where I've created multiple crops of each image as well as flipped/flopped versions to extend my limited dataset. I have written the dataset to a tfrecords file where each record consists of (simplified here to two crops and only a flipped version):
{
lbl: int,
crop_0: np.ndarray,
crop_1: np.ndarray,
crop_0_flipped: np.ndarray,
crop_1_flipped: np.ndarray
}
Basically 4 images / entry. During training, I'd like to treat each image as separate, i.e. feed each record as 4 images with the same label, shuffled with the rest of the images in the dataset, so that N images becomes 4N images. During testing (using a separate but similarly structured dataset), I'd like to take each image, only use the crop_0 and crop_1 images and average the softmax outputs for classification.
My question is - what is the best and most efficient way of training such a dataset? I'm willing to change my approach if this will make training more inefficient, and it seems that the simplest thing to do would have been to have separate tfrecords files for each version (crop & flip/flop images) and interleave the files into one dataset, but I do not want to have a whole bunch of files to deal with if I can help it.

Writing the dataset to disk with 4N images is an approach that you'll come to loath later (I did it this way originally and loath that code now). The better way is to keep your original dataset on disk as-is, don't write your preprocessing steps to disk. Do that kind of preprocessing in the CPU while you train. The tensorflow Dataset preprocessing pipeline makes this easy, modular, and provides the parallelization you need to take advantage of multiple cores at not extra coding expense.
This is the main guide:
https://www.tensorflow.org/programmers_guide/datasets
Your approach should be to create 2 Dataset objects, one for train and one for test. The train Dataset pipeline will perform all the data augmentation you mentioned. The test Dataset pipeline will not, naturally.
One key to understanding this approach is that you will not feed the data to tensorflow using feed_dict, instead, tensorflow will just invoke the Dataset pipeline to pull the data it needs for each batch.
To get parallelization you'll use the Dataset.map function to apply some set of transformations and use the property num_parallel_calls to distribute the operations across multiple cores. If your preprocessing can be done in tensorflow code, great, if not you'll need to use tf.py_func to use python preprocessing code.
The guide I linked to above describes all of this very well. You will want to us a feedable iterator described in the section called "Creating an iterator". This will allow you to get a string ID from each of the 2 datasets (train and test) and pass that string to tensorflow via feed_dict indicating which of the two datasets tensorflow should pull samples from.

How to expand tf.data.Dataset with additional example transformations in Tensorflow

I would like to double the size of an existing dataset I'm using to train a neural network in tensorflow on the fly by adding random noise to it. So when I'm done I'll have all the existing examples and also all the examples with noise added to them. I'd also like to interleave these as I transform them, so they come out in this order: example 1 without noise, example 1 with noise, example 2 without noise, example 2 with noise, etc. I'm struggling to accomplish this using the Dataset api. I've tried to use unbatch to accomplish this like so:
def generate_permutations(features, labels):
return [
[features, labels],
[add_noise(features), labels]
]
dataset.map(generate_permutations).apply(tf.contrib.data.unbatch())
but I get an error saying Shapes must be equal rank, but are 2 and 1. I'm guessing tensorflow is trying to make a tensor out of that batch I'm returning, but features and labels are different shapes, so that doesn't work. I could probably do this by just making two datasets and concating them together, but I'm worried that would result in very skewed training where I train nicely for half the epoch and suddenly all of the data has this new transformation to it for the second half. How can I accomplish this on the fly without writing these transformations to disk before feeding into tensorflow?

The Dataset.flat_map() transformation is the tool you need: it enables you to map a single input element into multiple elements, then flattens the result. Your code would look something like the following:
def generate_permutations(features, labels):
regular_ds = tf.data.Dataset.from_tensors((features, labels))
noisy_ds = tf.data.Dataset.from_tensors((add_noise(features), labels))
return regular_ds.concatenate(noisy_ds)
dataset = dataset.flat_map(generate_permutations)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.