Optimizing shuffle buffer size in tensorflow dataset api - python

I'm trying to use the dataset api to load data and find that I'm spending a majority of the time loading data into the shuffle buffer. How might I optimize this pipeline in order to minimize the amount of time spent populating the shuffle buffer.
(tf.data.Dataset.list_files(path)
.shuffle(num_files) # number of tfrecord files
.apply(tf.contrib.data.parallel_interleave(lambda f: tf.data.TFRecordDataset(f), cycle_length=num_files))
.shuffle(num_items) # number of images in the dataset
.map(parse_func, num_parallel_calls=8)
.map(get_patches, num_parallel_calls=8)
.apply(tf.contrib.data.unbatch())
# Patch buffer is currently the number of patches extracted per image
.apply(tf.contrib.data.shuffle_and_repeat(patch_buffer))
.batch(64)
.prefetch(1)
.make_one_shot_iterator())

Since I have at most thousands of images, my solution to this problem was to have a separate tfrecord file per image. That way individual images could be shuffled without having to load them into memory first. This drastically reduced the buffering that needed to occur.

Related

TFRecords in a custom DataGenerator

I am doing an LSTM model in which samples have different number of time steps. I want to optimise my code and performance so I do not want to use masking, but I want to write a generator that will group automatically the batches with the same number of steps. My idea is the following:
For each possible sequence length (ranges from 1-365) create a TFRecord dataset which will have only the samples with that length
In each generator loop, randomly choose sequence length, and take the batch of data from corresponding TFRecord dataset. One option is to read in batches from this TFRecord dataset until it is depleted - this is preferrable if it is costly to open and close TFRecord dataset multiple times.
Otherwise, if it is not costly to open and close TFRecord dataset, and read from the middle, we can randomly choose sequence length for each batch (sounds more robust)
I was able to implement this logic with .csvs (i.e. one csv for files having fixed sequence lengths) following this example, https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly, but I am wondering if I can gain more performance if I would do this with the TFRecords. However, I couldn't find any resources that would teach me how to use them with this degree of flexibility.
Can anybody point me in the right direction here?
Thanks!

Resizing images in data preprocessing for training convolution network

I am trying to load data from jpeg files to train a convolution network. The images are large, with 24 million pixels however, so loading and using the full resolution is not practical.
To get the images to a more useful format I am trying to load each image, rescale it and then append it to a list. Once this is done, I can then convert the list into a numpy array and feed into the network for training as usual.
My problem is that my data set is very large and it takes about a second to rescale every image, which means it is not feasible to resize every image the way I have currently implemented this:
length_training_DF = 30000
for i in range(length_training_DF):
im = plt.imread(TRAIN_IM_DIR + trainDF.iloc[i]['image_name'] + '.jpg')
image = block_reduce(im, block_size=(10, 10, 1), func=np.max)
trainX.append(image)
I have also used the following:
length_training_DF = 30000
from keras.preprocessing import image
for i in range(50):
img = image.load_img(TRAIN_IM_DIR + trainDF.iloc[0]['image_name'] + '.jpg', target_size=(224, 224))
trainX.append(ima)
Is there any way to load these images more quickly into a format for training a network? I have thought about using a keras dataset, perhaps by using tf.keras.preprocessing.image_dataset_from_directory(), but the directory in which the image data is stored is not formatted correctly into folders containing the same targets as is required by this method.
The images are for a binary classification problem.
The usual way would be to write a preprocessing script that loads the large images, rescales them, applies other operations if needed, and then saves each class to a separate directory, as required by ImageDataGenerator.
There are at least three good reasons to do that:
Typically, you will run your training process dozens of time. You don't want to every time do the rescaling or e.g. auto-white balance.
ImageDataGenerator provides vital methods for augmenting your training data set.
It's a good generator out of the box. Likely you don't want to load entire data set into memory.

More memory and velocity efficient way to read in and save images?

I am training a neural network. Therefore, I read in 182335 images (png-files) with the code below.
folders = glob.glob(r'path\to\images\*')
imagenames_list = []
for folder in folders:
for f in glob.glob(folder+'/*.png'):
imagenames_list.append(f)
read_images = []
for image in imagenames_list:
read_images.append(cv2.imread(image))
After some preprocessing of the data I created a pandas dataframe and saved it as a pickle-file:
df.to_pickle(r'data\data_as_pddataframe.pkl')
df.head()
Because of the huge number of images I have a relatively big pickle file (3GB). Because of this it lasts some time to read in this file and it also needs a lot of memory. Furthermore, when I am going to train the network in Google Colab, it happens that Colab crashes because of the huge amount of data.
Therefore, is there a more efficient way 1. to read in the data and 2. to store the dataframe?
Thanks!
I would do something like this:
Make sure that the batch size of your model is small enough that the input data and model parameters fit in memory.
Save the images as images on disk. Save the non-image data as a Parquet, CSV, or whatever (don't use Pickle for this). Put the image filenames in the table.
Keep data on disk, don't load it all into memory.
Load your non-image data as a regular data frame. Only load images from disk when it's required for your batch in SGD.

Appending large image dataset to an array

I am doing a classification using CNN on fake images. My data contains 100K+ images of two classes. I'm using Google Colab for doing the work. I already increased the RAM to 25 GB, but while appending the images to the array it keeps crashing. The most I can append is 16K images. Is it better to do it in smaller groups and then combine and take the average to get the accuracy, etc?
Is there any advice/solution that I can try for this?

Tensorflow data pipeline: Slow with caching to disk - how to improve evaluation performance?

I've built a data pipeline. Pseudo code is as follows:
dataset ->
dataset = augment(dataset)
dataset = dataset.batch(35).prefetch(1)
dataset = set_from_generator(to_feed_dict(dataset)) # expensive op
dataset = Cache('/tmp', dataset)
dataset = dataset.unbatch()
dataset = dataset.shuffle(64).batch(256).prefetch(1)
to_feed_dict(dataset)
1 to 5 actions are required to generate the pretrained model outputs. I cache them as they do not change throughout epochs (pretrained model weights are not updated). 5 to 8 actions prepare the dataset for training.
Different batch sizes have to be used, as the pretrained model inputs are of a much bigger dimensionality than the outputs.
The first epoch is slow, as it has to evaluate the pretrained model on every input item to generate templates and save them to the disk. Later epochs are faster, yet they're still quite slow - I suspect the bottleneck is reading the disk cache.
What could be improved in this data pipeline to reduce the issue?
Thank you!
prefetch(1) means that there will be only one element prefetched, I think you may want to have it as big as the batch size or larger.
After first cache you may try to put it second time but without providing a path, so it would cache some in the memory.
Maybe your HDD is just slow? ;)
Another idea is you could just manually write to compressed TFRecord after steps 1-4 and then read it with another dataset. Compressed file has lower I/O but causes higher CPU usage.

Categories

Resources