Tensorflow Benchmarks Input Pipeline Parallelize Image Processing?

Tensorflow Benchmarks Input Pipeline Parallelize Image Processing? - python

I'm running the TF benchmarks and have read the document High-Performance Models, I have a question:
In the document, it said
Parallelize I/O Reads
data_flow_ops.RecordInput is used to parallelize reading from disk. Given a list of input files representing TFRecords, RecordInput continuously reads records using background threads. The records are placed into its own large internal pool and when it has loaded at least half of its capacity, it produces output tensors.
This op has its own internal threads that are dominated by I/O time that consume minimal CPU, which allows it to run smoothly in parallel with the rest of the model.
Parallelize Image Processing
After images are read from RecordInput they are passed as tensors to the image processing pipeline. To make the image processing pipeline easier to explain, assume that the input pipeline is targeting 8 GPUs with a batch size of 256 (32 per GPU).
256 records are read and processed individually in parallel. This starts with 256 independent RecordInput read ops in the graph. Each read op is followed by an identical set of ops for image preprocessing that are considered independent and executed in parallel. The image preprocessing ops include operations such as image decoding, distortion, and resizing.
But I read the code of preprocessing.py
record_input = data_flow_ops.RecordInput(
file_pattern=dataset.tf_record_pattern(subset),
seed=301,
parallelism=64,
buffer_size=10000,
batch_size=self.batch_size,
shift_ratio=shift_ratio,
name='record_input')
records = record_input.get_yield_op()
records = tf.split(records, self.batch_size, 0)
records = [tf.reshape(record, []) for record in records]
for idx in xrange(self.batch_size):
value = records[idx]
(label, image) = self.parse_and_preprocess(value, idx)
split_index = idx % self.num_splits
labels[split_index].append(label)
images[split_index].append(image)
It seems reading from disk in parallel due to data_flow_ops.RecordInput op,
but image processing ops in the for loop is serialized.
I want to know I am wrong or document is wrong?
If I am wrong, how do image processing ops execute in parallel?
Thanks very much!

Related

using ray + light gbm + limited memory

So, I would like to train a lightGBM on a remote, large ray cluster and a large dataset. Before that, I would like to write the code such that I can run the training also in a memory-constrained setting, e.g. my local laptop, where the dataset does not fit in-mem. That will require some way of lazy loading the data.
The way I imagine it, I should be possible with ray to load batches of random samples of the large dataset from disk (multiple .pq files) and feed them to the lightgbm training function. The memory should thereby act as a fast buffer, which contains random, loaded batches that are fed to the training function and then removed from memory. Multiple workers take care of training + IO ops for loading new samples from disk into memory. The maximum amount of memory can be constrained to not exceed my local resources, such that my pc doesn't crash. Is this possible?
I did not understand yet whether the LGBM needs the full dataset at once, or can be fed batches iteratively, such as with neural networks, for instance. So far, I have tried using the lightgbm_ray lib for this:
from lightgbm_ray import RayDMatrix, RayParams, train, RayFileType
# some stuff before
...
# make dataset
data_train = RayDMatrix(
data=filenames,
label=TARGET,
feature_names=features,
filetype=RayFileType.PARQUET,
num_actors=2,
lazy=True,
)
# feed to training function
evals_result = {}
bst = train(
params_model,
data_train,
evals_result=evals_result,
valid_sets=[data_train],
valid_names=["train"],
verbose_eval=False,
ray_params=RayParams(num_actors=2, cpus_per_actor=2)
)
I thought the lazy=True keyword might take care of it, however, when executing this, I see the memory being maxed out and then my app crashes.
Thanks for any advice!

LightGBM requires loading the entire dataset for training, so in this case, you can test on your laptop with a subset of the data (i.e. only pass a subset of the parquet filenames in).
The lazy=True flag delays the data loading to be split across the actors, rather than loading into memory first, then splitting+sending to actors. However, this would still load the entire dataset into memory, since all actors are on the same (local) node.
Additionally, when you do move to running on the remote cluster, these tips might be helpful to optimize memory usage: https://docs.ray.io/en/latest/train/gbdt.html?highlight=xgboost%20memro#how-to-optimize-xgboost-memory-usage.

RAMshortage in google collab for traing CNN due to images loading [duplicate]

I am designing a multi_label_image_Classifier. For this i need to load train_images of around 7867 nos. While I am loading the images the RAM usage increases from 0.92 to 12.5 GB.
After Loading when I am fitting the images into a numpy array RAM uses total available size i.e. 25.54 GB and the code stop executing with an error "your session crashed".
Sample code which I am using
train_images= []
for i in tqdm(range(train.shape[0)):
img = image.load_img(
'/content/Multi_Label_dataset/Images/'+train['Id'][i]+'.jpg',
target_size=(400,400,3)
)
img = image.img_to_array(img)
img = img/255
train_image.append(img)
Upto the above RAM usage was 12.52 GB
X= np.array(train_image)
While Executing this line, RAM usage becomes red and "Session Crashed Message" popped up.
How to handle this???

Your dataset is to large to be loaded into the RAM all at once. This is a common case when using image datasets. Along with the dataset, the RAM also need to hold the model, other variables and additional space for processing.
To help with loading you can make use of data_generators() and flow_from_directory(). These methods are available in Keras, have a look at the documentation.
The data_generator() takes care of all the image pre-precessing such as reshaping and normalizing. The flow_from_directory() will help solve your memory issue. It dynamically loads a batch of images from the specified directory and the passes them to the model after applying the pre-processing techniques.

Optimizing shuffle buffer size in tensorflow dataset api

I'm trying to use the dataset api to load data and find that I'm spending a majority of the time loading data into the shuffle buffer. How might I optimize this pipeline in order to minimize the amount of time spent populating the shuffle buffer.
(tf.data.Dataset.list_files(path)
.shuffle(num_files) # number of tfrecord files
.apply(tf.contrib.data.parallel_interleave(lambda f: tf.data.TFRecordDataset(f), cycle_length=num_files))
.shuffle(num_items) # number of images in the dataset
.map(parse_func, num_parallel_calls=8)
.map(get_patches, num_parallel_calls=8)
.apply(tf.contrib.data.unbatch())
# Patch buffer is currently the number of patches extracted per image
.apply(tf.contrib.data.shuffle_and_repeat(patch_buffer))
.batch(64)
.prefetch(1)
.make_one_shot_iterator())

Since I have at most thousands of images, my solution to this problem was to have a separate tfrecord file per image. That way individual images could be shuffled without having to load them into memory first. This drastically reduced the buffering that needed to occur.

splitting tfrecords dataset with multiple features

I have an image classification task where I've created multiple crops of each image as well as flipped/flopped versions to extend my limited dataset. I have written the dataset to a tfrecords file where each record consists of (simplified here to two crops and only a flipped version):
{
lbl: int,
crop_0: np.ndarray,
crop_1: np.ndarray,
crop_0_flipped: np.ndarray,
crop_1_flipped: np.ndarray
}
Basically 4 images / entry. During training, I'd like to treat each image as separate, i.e. feed each record as 4 images with the same label, shuffled with the rest of the images in the dataset, so that N images becomes 4N images. During testing (using a separate but similarly structured dataset), I'd like to take each image, only use the crop_0 and crop_1 images and average the softmax outputs for classification.
My question is - what is the best and most efficient way of training such a dataset? I'm willing to change my approach if this will make training more inefficient, and it seems that the simplest thing to do would have been to have separate tfrecords files for each version (crop & flip/flop images) and interleave the files into one dataset, but I do not want to have a whole bunch of files to deal with if I can help it.

Writing the dataset to disk with 4N images is an approach that you'll come to loath later (I did it this way originally and loath that code now). The better way is to keep your original dataset on disk as-is, don't write your preprocessing steps to disk. Do that kind of preprocessing in the CPU while you train. The tensorflow Dataset preprocessing pipeline makes this easy, modular, and provides the parallelization you need to take advantage of multiple cores at not extra coding expense.
This is the main guide:
https://www.tensorflow.org/programmers_guide/datasets
Your approach should be to create 2 Dataset objects, one for train and one for test. The train Dataset pipeline will perform all the data augmentation you mentioned. The test Dataset pipeline will not, naturally.
One key to understanding this approach is that you will not feed the data to tensorflow using feed_dict, instead, tensorflow will just invoke the Dataset pipeline to pull the data it needs for each batch.
To get parallelization you'll use the Dataset.map function to apply some set of transformations and use the property num_parallel_calls to distribute the operations across multiple cores. If your preprocessing can be done in tensorflow code, great, if not you'll need to use tf.py_func to use python preprocessing code.
The guide I linked to above describes all of this very well. You will want to us a feedable iterator described in the section called "Creating an iterator". This will allow you to get a string ID from each of the 2 datasets (train and test) and pass that string to tensorflow via feed_dict indicating which of the two datasets tensorflow should pull samples from.

reading a large dataset in tensorflow

I am not quite sure about how file-queue works. I am trying to use a large dataset like imagenet as input. So preloading data is not the case, so I am wondering how to use the file-queue. According to the tutorial, we can convert data to TFRecords file as input. Now we have a single big TFRecords file. So when we specify a FIFO queue for the reader, does it mean the program would fetch a batch of data each time and feed the graph instead of loading the whole file of data?

The amount of pre-fetching depends on your queue capacity. If you use string_input_producer for your filenames and batch for batching, you will have 2 queues - filename queue, and prefetching queue created by batch. Queue created by batch has default capacity of 32, controlled by batch(...,capacity=) argument, therefore it can prefetch up to 32 images. If you follow outline in TensorFlow official howto's, processing examples (everything after batch) will happen in main Python thread, whereas filling up the queue will happen in threads created/started by batch/start_queue_runners, so prefetching new data and running prefetched data through the network will occur concurrently, blocking when the queue gets full or empty.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.