Appending large image dataset to an array

Appending large image dataset to an array - python

I am doing a classification using CNN on fake images. My data contains 100K+ images of two classes. I'm using Google Colab for doing the work. I already increased the RAM to 25 GB, but while appending the images to the array it keeps crashing. The most I can append is 16K images. Is it better to do it in smaller groups and then combine and take the average to get the accuracy, etc?
Is there any advice/solution that I can try for this?

Related

Clustering small image set with unknown number of clusters

I am working on a project where I am trying to cluster different small image sets (each set has around 100 images) without knowing how many classes can each set contain (it can go from 2 to 4). I have tried using a convolutional autoencoder for feature extraction and dimensionality reduction and then ran HDBSCAN on the reduced images, but in vain.
Does anyone have a suggestion for which feature extraction/clustering algorithm I can go with?

TensorFlow tf.data.Dataset API for medical imaging

I'm a student in medical imaging. I have to construct a neural network for image segmentation. I have a data set of 285 subjects, each with 4 modalities (T1, T2, T1ce, FLAIR) + their respective segmentation ground truth. Everything is in 3D with resolution of 240x240x155 voxels (this is BraTS data set).
As we know, I cannot input the whole image on a GPU for memory reasons. I have to preprocess the images and decompose them in 3D overlapping patches (sub-volumes of 40x40x40) which I do with scikit-image view_as_windows and then serialize the windows in a TFRecords file. Since each patch overlaps of 10 voxels in each direction, these sums to 5,292 patches per volume. The problem is, with only 1 modality, I get sizes of 800 GB per TFRecords file. Plus, I have to compute their respective segmentation weight map and store it as patches too. Segmentation is also stored as patches in the same file.
And I eventually have to include all the other modalities, which would take nothing less than terabytes of storage. I also have to remember I must also sample equivalent number of patches between background and foreground (class balancing).
So, I guess I have to do all preprocessing steps on-the-fly, just before every training step (while hoping not to slow down training too). I cannot use tf.data.Dataset.from_tensors() since I cannot load everything in RAM. I cannot use tf.data.Dataset.from_tfrecords() since preprocessing the whole thing before takes a lot of storage and I will eventually run out.
The question is : what's left for me for doing this cleanly with the possibility to reload the model after training for image inference ?
Thank you very much and feel free to ask for any other details.
Pierre-Luc

Finally, I found a method to solve my problem.
I first crop a subject's image without applying the actual crop. I only measure the slices I need to crop the volume to only the brain. I then serialize all the data set images into one TFRecord file, each training example being an image modality, original image's shape and the slices (saved as Int64 feature).
I decode the TFRecords afterward. Each training sample are reshaped to the shape it contains in a feature. I stack all the image modalities into a stack using tf.stack() method. I crop the stack using the previously extracted slices (the crop then applies to all images in the stack). I finally get some random patches using tf.random_crop() method that allows me to randomly crop a 4-D array (heigh, width, depth, channel).
The only thing I still haven't figured out is data augmentation. Since all this is occurring in Tensors format, I cannot use plain Python and NumPy to rotate, shear, flip a 4-D array. I would need to do it in the tf.Session(), but I would rather like to avoid this and directly input the training handle.
For the evaluation, I serialize in a TFRecords file only one test subject per file. The test subject contains all modalities too, but since there is no TensorFLow methods to extract patches in 4-D, the image is preprocessed in small patches using Scikit-Learn extract_patches() method. I serialize these patches to the TFRecords.
This way, training TFRecords is a lot smaller. I can evaluate the test data using batch prediction.
Thanks for reading and feel free to comment !

PyTorch data loading from multiple different-sized datasets

I have multiple datasets, each with a different number of images (and different image dimensions) in it. In the training loop I want to load a batch of images randomly from among all the datasets but so that each batch only contains images from a single dataset. For example, I have datasets A, B, C, D and each has images 01.jpg, 02.jpg, … n.jpg (where n depends on the dataset), and let’s say the batch size is 3. In the first loaded batch, for example, I may get images [B/02.jpg, B/06.jpg, B/12.jpg], in the next batch [D/01.jpg, D/05.jpg, D/12.jpg], etc.
So far I have considered the following:
Use a different DataLoader for each dataset, e.g. dataloaderA, dataloaderB, etc., and then in each training loop randomly select one of the dataloaders and get a batch from it. However, this will require a for loop and for large number of datasets it would be very slow since it can’t be split among workers to do it in parallel.
Use a single DataLoader with all of the images from all datasets together but with a custom collate_fn which will create a batch using only images from the same dataset. (I’m not sure how exactly to go about this.)
I have looked at the ConcatDataset class but from its source code it looks like if I use it and try getting a new batch the images in it will be mixed up from among different datasets which I don’t want.
What would be the best way to do this? Thanks!

You can use ConcatDataset, and provide a batch_sampler to DataLoader.
concat_dataset = ConcatDataset((dataset1, dataset2))
ConcatDataset.comulative_sizes will give you the boundaries between each dataset you have:
ds_indices = concat_dataset.cumulative_sizes
Now, you can use ds_indices to create a batch sampler. See the source for BatchSampler for reference. Your batch sampler just has to return a list with N random indices that will respect the ds_indices boundaries. This will guarantee that your batches will have elements from the same dataset.

Optimizing shuffle buffer size in tensorflow dataset api

I'm trying to use the dataset api to load data and find that I'm spending a majority of the time loading data into the shuffle buffer. How might I optimize this pipeline in order to minimize the amount of time spent populating the shuffle buffer.
(tf.data.Dataset.list_files(path)
.shuffle(num_files) # number of tfrecord files
.apply(tf.contrib.data.parallel_interleave(lambda f: tf.data.TFRecordDataset(f), cycle_length=num_files))
.shuffle(num_items) # number of images in the dataset
.map(parse_func, num_parallel_calls=8)
.map(get_patches, num_parallel_calls=8)
.apply(tf.contrib.data.unbatch())
# Patch buffer is currently the number of patches extracted per image
.apply(tf.contrib.data.shuffle_and_repeat(patch_buffer))
.batch(64)
.prefetch(1)
.make_one_shot_iterator())

Since I have at most thousands of images, my solution to this problem was to have a separate tfrecord file per image. That way individual images could be shuffled without having to load them into memory first. This drastically reduced the buffering that needed to occur.

splitting tfrecords dataset with multiple features

I have an image classification task where I've created multiple crops of each image as well as flipped/flopped versions to extend my limited dataset. I have written the dataset to a tfrecords file where each record consists of (simplified here to two crops and only a flipped version):
{
lbl: int,
crop_0: np.ndarray,
crop_1: np.ndarray,
crop_0_flipped: np.ndarray,
crop_1_flipped: np.ndarray
}
Basically 4 images / entry. During training, I'd like to treat each image as separate, i.e. feed each record as 4 images with the same label, shuffled with the rest of the images in the dataset, so that N images becomes 4N images. During testing (using a separate but similarly structured dataset), I'd like to take each image, only use the crop_0 and crop_1 images and average the softmax outputs for classification.
My question is - what is the best and most efficient way of training such a dataset? I'm willing to change my approach if this will make training more inefficient, and it seems that the simplest thing to do would have been to have separate tfrecords files for each version (crop & flip/flop images) and interleave the files into one dataset, but I do not want to have a whole bunch of files to deal with if I can help it.

Writing the dataset to disk with 4N images is an approach that you'll come to loath later (I did it this way originally and loath that code now). The better way is to keep your original dataset on disk as-is, don't write your preprocessing steps to disk. Do that kind of preprocessing in the CPU while you train. The tensorflow Dataset preprocessing pipeline makes this easy, modular, and provides the parallelization you need to take advantage of multiple cores at not extra coding expense.
This is the main guide:
https://www.tensorflow.org/programmers_guide/datasets
Your approach should be to create 2 Dataset objects, one for train and one for test. The train Dataset pipeline will perform all the data augmentation you mentioned. The test Dataset pipeline will not, naturally.
One key to understanding this approach is that you will not feed the data to tensorflow using feed_dict, instead, tensorflow will just invoke the Dataset pipeline to pull the data it needs for each batch.
To get parallelization you'll use the Dataset.map function to apply some set of transformations and use the property num_parallel_calls to distribute the operations across multiple cores. If your preprocessing can be done in tensorflow code, great, if not you'll need to use tf.py_func to use python preprocessing code.
The guide I linked to above describes all of this very well. You will want to us a feedable iterator described in the section called "Creating an iterator". This will allow you to get a string ID from each of the 2 datasets (train and test) and pass that string to tensorflow via feed_dict indicating which of the two datasets tensorflow should pull samples from.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.