splitting tfrecords dataset with multiple features

splitting tfrecords dataset with multiple features - python

I have an image classification task where I've created multiple crops of each image as well as flipped/flopped versions to extend my limited dataset. I have written the dataset to a tfrecords file where each record consists of (simplified here to two crops and only a flipped version):
{
lbl: int,
crop_0: np.ndarray,
crop_1: np.ndarray,
crop_0_flipped: np.ndarray,
crop_1_flipped: np.ndarray
}
Basically 4 images / entry. During training, I'd like to treat each image as separate, i.e. feed each record as 4 images with the same label, shuffled with the rest of the images in the dataset, so that N images becomes 4N images. During testing (using a separate but similarly structured dataset), I'd like to take each image, only use the crop_0 and crop_1 images and average the softmax outputs for classification.
My question is - what is the best and most efficient way of training such a dataset? I'm willing to change my approach if this will make training more inefficient, and it seems that the simplest thing to do would have been to have separate tfrecords files for each version (crop & flip/flop images) and interleave the files into one dataset, but I do not want to have a whole bunch of files to deal with if I can help it.

Writing the dataset to disk with 4N images is an approach that you'll come to loath later (I did it this way originally and loath that code now). The better way is to keep your original dataset on disk as-is, don't write your preprocessing steps to disk. Do that kind of preprocessing in the CPU while you train. The tensorflow Dataset preprocessing pipeline makes this easy, modular, and provides the parallelization you need to take advantage of multiple cores at not extra coding expense.
This is the main guide:
https://www.tensorflow.org/programmers_guide/datasets
Your approach should be to create 2 Dataset objects, one for train and one for test. The train Dataset pipeline will perform all the data augmentation you mentioned. The test Dataset pipeline will not, naturally.
One key to understanding this approach is that you will not feed the data to tensorflow using feed_dict, instead, tensorflow will just invoke the Dataset pipeline to pull the data it needs for each batch.
To get parallelization you'll use the Dataset.map function to apply some set of transformations and use the property num_parallel_calls to distribute the operations across multiple cores. If your preprocessing can be done in tensorflow code, great, if not you'll need to use tf.py_func to use python preprocessing code.
The guide I linked to above describes all of this very well. You will want to us a feedable iterator described in the section called "Creating an iterator". This will allow you to get a string ID from each of the 2 datasets (train and test) and pass that string to tensorflow via feed_dict indicating which of the two datasets tensorflow should pull samples from.

Related

Efficiently load large .npy files (>20GB) with Keras/Tensorflow dataloader

I am currently implementing a machine learning model which uses a rather heavy representation of data.
My dataset is composed of images. Each of these images is encoded into a (224, 224, 103) matrix, making the entire dataset very heavy. I store these matrixes on the disk and load them during the training.
What I am currently doing right now is using mini-batches of 8 images and loading the .npy files for these 8 images from the disk during the entire training process. This is slow but it works.
Is there a more efficient way to do it using Keras/Tensorflow (which is what I'm using to code my model)?
I unfortunately couldn't find much about a dataloader that would allow me to do this.
Thanks in advance.

You have several options to do this.
I will assume that the transformations you are doing to the images to get the final (224, 224, 103) matrix is very expensive, and that it's not desirable to do the pre-processing on the data loading. If this is not the case, you might benefit from reading the tutorial relevant to image processing.
I suggest you use a python generator to read the data, and to use tf.data to create a data pipeline to feed these .npy files to your model. The basic idea is very simple. you use a wrapper to ingest data from a generator that will read the files as needed. The relevant documentation and examples are here.
Now, once you get that working, I think it would be a good idea for you to optimize your pipeline, especially if you're planning to train in multiple GPUs or multiple computers.

What is the difference between tfrecord and bottleneck

I have been studying transfer learning with models like inception_v4 and inception_resnet_v2. Found some projects that uses bottleneck and some uses tfrecords to store the training images. When retraining the inception_v4 model with the same data using those two methods bottleneck gave 95% accuracy and tfrecord only gave 75%. But, all the new projects seems to use tfrecords for data and .ckpt format to store the model. Can someone explain me whats the difference and which one is better in which case

If you are working with large datasets, using a binary file format for storage of your data can have a significant impact on the performance of your import pipeline. Hence, it will affect your training time of the model.
By using TFRecords, it is possible to store sequence data. For e.g, a series of data. Besides, it easy to combine multiple datasets and integrates seamlessly with the data import and preprocessing functionality provided by the library.
For more information about TFrecords, please refer this link.

PyTorch data loading from multiple different-sized datasets

I have multiple datasets, each with a different number of images (and different image dimensions) in it. In the training loop I want to load a batch of images randomly from among all the datasets but so that each batch only contains images from a single dataset. For example, I have datasets A, B, C, D and each has images 01.jpg, 02.jpg, … n.jpg (where n depends on the dataset), and let’s say the batch size is 3. In the first loaded batch, for example, I may get images [B/02.jpg, B/06.jpg, B/12.jpg], in the next batch [D/01.jpg, D/05.jpg, D/12.jpg], etc.
So far I have considered the following:
Use a different DataLoader for each dataset, e.g. dataloaderA, dataloaderB, etc., and then in each training loop randomly select one of the dataloaders and get a batch from it. However, this will require a for loop and for large number of datasets it would be very slow since it can’t be split among workers to do it in parallel.
Use a single DataLoader with all of the images from all datasets together but with a custom collate_fn which will create a batch using only images from the same dataset. (I’m not sure how exactly to go about this.)
I have looked at the ConcatDataset class but from its source code it looks like if I use it and try getting a new batch the images in it will be mixed up from among different datasets which I don’t want.
What would be the best way to do this? Thanks!

You can use ConcatDataset, and provide a batch_sampler to DataLoader.
concat_dataset = ConcatDataset((dataset1, dataset2))
ConcatDataset.comulative_sizes will give you the boundaries between each dataset you have:
ds_indices = concat_dataset.cumulative_sizes
Now, you can use ds_indices to create a batch sampler. See the source for BatchSampler for reference. Your batch sampler just has to return a list with N random indices that will respect the ds_indices boundaries. This will guarantee that your batches will have elements from the same dataset.

How to use model.fit_generator in keras

When and how should I use fit_generator?
What is the difference between fit and fit_generator?

If you have prepared your data and labels in all necessary aspects and simply can assign these to an array x and y, than use model.fit(x, y).
If you need to preprocess and/or augment your data while training, than you can take advantage of the generators that Keras provides.
You could for example augment images by applying random transforms (very helpful if you only have little data to train with), pad sequences, tokenize text, let Keras automagically read your data from a folder and assign appropiate classes (flow_from_directory) and much more.
See here for examples and boilerplate code for image preprocessing: https://keras.io/preprocessing/image/
or here for text preprocessing:
https://keras.io/preprocessing/text/
fit_generator also will help you to train in a more memory efficient way since you load data only when needed. The generator function yields (aka "delivers") data to your model batch by batch on demand, so to say.

They are useful for on-the-fly augmentations, which the previous poster mentioned. This however is not neccessarily restricted to generators, because you can fit for one epoch and then augment your data and fit again.
What does not work with fit is using too much data per epoch though. This means that if you have a dataset of 1 TB and only 8 GB of RAM you can use the generator to load the data on the fly and only hold a couple of batches in memory. This helps tremendously on scaling to huge datasets.

Triplet, Siamese and Softmax in Tensorflow

I'd like to compare the performance of following types of CNNs for two different large image data sets. The goal is to measure the similarity between two images, which both have not been seen during training. I have access to 2 GPUs and 16 CPU cores.
Triplet CNN (Input: Three images, Label: encoded in position)
Siamese CNN (Input: Two images, Label: one binary label)
Softmax CNN for Feature Learning (Input: One image, Label: one integer label)
For Softmax I can store the data in a binary format (Sequentially store label and image). Then read it with a TensorFlow reader.
To use the same method for Triplet and Siamese Networks, I'd have to generate the combinations in advance and store them to disk. That would result in a big overhead in both the time it takes to create the file and in disk space. How can it be done on the fly?
Another easy way would be to use feed_dict, but this would be slow. Therefore the problem would be solved if it would be possible to run the same function which I'd use for feed_dict in parallel and convert the result to a TensorFlow tensor as a last step. But as far as I know such a conversion does not exist so one has to read the files with a TensorFlow reader in the first place and do the whole process with TensorFlow methods. Is this correct?

Short answer do the pair/triplet creation online with numpy no need to convert it to a tensor as the feed_dict arguments accepts numpy arrays already.
The best would be to use tf.nn.embedding_lookup() from already existing batches in combination with itertools to create the indices of the pairs but for a naïve non-optimal solution you can look at the gen_batches_siamese.py script in my github repository. Where I reimplemented the caffe siamese example.
Obviously it will be less efficient than using tensorflow queues but my advice would be to try this baseline first before going to the pure tensorflow solution.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.