reading a large dataset in tensorflow - python

I am not quite sure about how file-queue works. I am trying to use a large dataset like imagenet as input. So preloading data is not the case, so I am wondering how to use the file-queue. According to the tutorial, we can convert data to TFRecords file as input. Now we have a single big TFRecords file. So when we specify a FIFO queue for the reader, does it mean the program would fetch a batch of data each time and feed the graph instead of loading the whole file of data?

The amount of pre-fetching depends on your queue capacity. If you use string_input_producer for your filenames and batch for batching, you will have 2 queues - filename queue, and prefetching queue created by batch. Queue created by batch has default capacity of 32, controlled by batch(...,capacity=) argument, therefore it can prefetch up to 32 images. If you follow outline in TensorFlow official howto's, processing examples (everything after batch) will happen in main Python thread, whereas filling up the queue will happen in threads created/started by batch/start_queue_runners, so prefetching new data and running prefetched data through the network will occur concurrently, blocking when the queue gets full or empty.

Related

which one to use to process data for sagemaker batch inferencing pipeline - SKlearnEstimator or SKlearnProcessor

I'm building a Sagemaker batch inferencing pipeline and get confused about the options to process features (before inferencing) between using sagemaker.sklearn.processing.SKLearnProcessor and sagemaker.sklearn.estimator.SKLearn
My understanding of these two options are:
There are docs from aws to use sagemaker.sklearn.estimator.SKLearn to do the batch transformation to process the data.
The pros of using this class and its .create_model() method is that I can incorporate the created model(to process the feature before inferencing) to sagemaker.pipeline.PipelineModel which's deployed on endpoint. so the whole pipeline is behind a single endpoint to be called when inference request input in. This detailed from:
https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-python-sdk/scikit_learn_inference_pipeline/Inference%20Pipeline%20with%20Scikit-learn%20and%20Linear%20Learner.html
I don't know the specific cons, and that's the first question (1).
However, if it's only for data processing, I can also use sagemaker.sklearn.processing.SKLearnProcessor to create Sagemaker Processing jobs to process features, then dump to s3 for model to batch inferencing.
The pros to me is that it's making more sense to me to have a job that designed for processing, but cons is that it seems like I have to write a handler to pipeline the processing and inferencing myself, unlike the sagemaker.sklearn.estimator.SKLearn.
https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation/scikit_learn_data_processing_and_model_evaluation.html
So, my next question (2) is there a way to involve SKLearnProcessor in the sagemaker.pipeline.PipelineModel? if not, the following up question (3) is that if SKLearnProcessor is not designed for using in inferencing, what's the use case of it.
The final question (4) is that from efficiency perspective, what's pros and cons using each method in a Sagemaker batch inferencing pipeline?
SageMaker Inference Pipeline is a functionality of SageMaker hosting whereby you can create a serial inference pipeline (chain of containers) on an endpoint and/or Batch Transform Job.
With regards to the link you shared, a common pattern is to use two containers where one container hosts the Scikit-learn model which will act as the pre-processing step before passing the request onto the second container which hosts the model either on an endpoint or Batch Transform Job.
The SKLearnProcessor is used to kick off a SKLearn Processing Job. You can use the SKLearnProcessor with a processing script to process your data. As such, SKLearnProcessor cannot be used in a Serial Inference Pipeline (sagemaker.pipeline.PipelineModel).
As stated above SKLearnProcessor is designed to kick off a SageMaker Processing Job that makes use of the Scikit-learn container that can be used for data pre- or post-processing and model evaluation workloads. Kindly see this link for more information.
Are you are trying to decide whether to process your data with SKLearnProcessor (Processing Job) or make use of a PipelineModel that contains a preprocessing step in a Batch Transform Job?
If so, making the decision depends on your use case. If you were to use use a Processing Job (SKLearnProcessor) then the Job would need be to kicked off before the Batch Transform Job. Once the Processing Job has completed you can then kick of the Batch Transform Job with the output of the Processing Job as input to the Batch Transform Job.
On the other hand, if you were to use Serial Inference Pipeline (sagemaker.pipeline.PipelineModel) then you would just need to make sure that the first container preprocesses the request to make sure it is compliant with what the model expects. This option would entail the processing being done on a request(s) basis within the Batch Transform Job itself.

Tensorflow: how to manually shard a dataset

I'm using the MirroredStrategy to perform multi-gpu training and it doesn't appear to be properly sharding the data. How do you go about manually sharding data?
I know that I could use the shard method for a tf.data dataset, but for that I need access to the worker ID and I can't figure out how to get that. How do I access the worker ids?
MirroredStrategy runs on a single worker (for multiple workers there is MultiWorkerMirroredStrategy). Because it runs on only one worker, MirroredStrategy runs a single Dataset pipeline without any data sharding. At each step, MirroredStrategy requests one dataset element per worker.

Tensorflow Benchmarks Input Pipeline Parallelize Image Processing?

I'm running the TF benchmarks and have read the document High-Performance Models, I have a question:
In the document, it said
Parallelize I/O Reads
data_flow_ops.RecordInput is used to parallelize reading from disk. Given a list of input files representing TFRecords, RecordInput continuously reads records using background threads. The records are placed into its own large internal pool and when it has loaded at least half of its capacity, it produces output tensors.
This op has its own internal threads that are dominated by I/O time that consume minimal CPU, which allows it to run smoothly in parallel with the rest of the model.
Parallelize Image Processing
After images are read from RecordInput they are passed as tensors to the image processing pipeline. To make the image processing pipeline easier to explain, assume that the input pipeline is targeting 8 GPUs with a batch size of 256 (32 per GPU).
256 records are read and processed individually in parallel. This starts with 256 independent RecordInput read ops in the graph. Each read op is followed by an identical set of ops for image preprocessing that are considered independent and executed in parallel. The image preprocessing ops include operations such as image decoding, distortion, and resizing.
But I read the code of preprocessing.py
record_input = data_flow_ops.RecordInput(
file_pattern=dataset.tf_record_pattern(subset),
seed=301,
parallelism=64,
buffer_size=10000,
batch_size=self.batch_size,
shift_ratio=shift_ratio,
name='record_input')
records = record_input.get_yield_op()
records = tf.split(records, self.batch_size, 0)
records = [tf.reshape(record, []) for record in records]
for idx in xrange(self.batch_size):
value = records[idx]
(label, image) = self.parse_and_preprocess(value, idx)
split_index = idx % self.num_splits
labels[split_index].append(label)
images[split_index].append(image)
It seems reading from disk in parallel due to data_flow_ops.RecordInput op,
but image processing ops in the for loop is serialized.
I want to know I am wrong or document is wrong?
If I am wrong, how do image processing ops execute in parallel?
Thanks very much!

How to enable Dataset pipeline has distributed reading and consuming

It is easy to use two threads that one keeps feeding data to the queue and the other consumes data from the queue and perform the computation. Since the TensorFlow recommends Dataset as input pipeline after 1.2.0., I would like to use the Dataset and its iterator to accomplish the task above, namely:
There are two processes, one feeds and the other consumes;
The pipeline suspends either it is full or empty and it stops when computation finishes at consuming.
P.S. Why in the tutorial of Threading and Queues, TensorFlow uses thread instead of process?
Thank you in advance.
Distributed tf.contrib.data pipelines are not yet supported as of TensorFlow 1.3. We are working on support for splitting datasets across devices and/or processes, but that support is not yet ready.
In the meantime, the easiest way to achieve your goal is to use a tf.FIFOQueue. You can define a Dataset that reads from a queue as follows:
q = tf.FIFOQueue(...)
# Define a dummy dataset that contains the same value repeated indefinitely.
dummy = tf.contrib.data.Dataset.from_tensors(0).repeat(None)
dataset_from_queue = dummy.map(lambda _: q.dequeue())
You can then compose other Dataset tranformations with dataset_from_queue.

TensorFlow Experiment: how to avoid loading all data in memory with input_fn?

I'm struggling with passing my (messy) code from tensorflow core to the Estimator paradigm, especially using Experiments - with learn_runner.run. But I'm actually having issues feeding data to my neural network.
What I'm trying to achieve is actually pretty close to what's done with all the examples of TensorFlow and the tf.TextLineReader, e.g. https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/census/customestimator/trainer/model.py#L297, though I load data not from a file on disk but with a web-service.
From my understanding (and looking at the code of tensorflow.python.estimator._train_model()) the input_fn is only called once and not at each iteration. I could easily load all my data, and then do something like:
def input_fn():
data = # all data in memory
batch = tf.train.input_producer(tf.constant(data))
return batch.dequeue_many(batch_size)
but this is not sustainable as my data won't fit in memory. I'm trying to do something like:
1. load first piece of data (say N lines)
2. consume it by batches in a queue just like the input_fn above
2'. feed this queue asynchronously with new data when it's almost empty
I know how to do it in "pure" tf, e.g. How to prefetch data using a custom python function in tensorflow or Tensorflow: custom data load + asynchronous computation but I'm finding it hard to transpose it to the Experiment paradigm as I don't have access to the session to load things by myself, nor to the graph to append operations inside.
EDIT
I managed to do it using tf.py_func(), something like:
class Reader(object):
# a Python object that can load data and have some intelligence, not related to TF, initialized with batch_sized
def read_up_to(self):
"""Reads up to batch_size elements loaded in Python"""
def input_fn():
reader = Reader() # instantiated once
return tf.py_func(reader.read_up_to, inp=[], Tout=...)
I works fine, though it's a bit slower (as expected there's a way round from C++ execution to Python that introduces about 50% delay). I'm trying to work around this by putting in a specific TensorFlow queue the Python data that's read in the reader asynchronously, so that loading could be done without passing data from Python to C++ (just as in the two links above).
I had a similar issue on which I found a fix by using a SessionRunHook. This hook (there are also others) allows you to initialize operations just after the Session is created.
tf.data.Dataset.from_generator is a dataset that calls a function of yours to generate the data one example at a time. This gives you a hook to program the generation of data however you want, such as loading in batches then yielding a single example from the batch on each call to it. This other question has an example.

Categories

Resources