How to enable Dataset pipeline has distributed reading and consuming

How to enable Dataset pipeline has distributed reading and consuming - python

It is easy to use two threads that one keeps feeding data to the queue and the other consumes data from the queue and perform the computation. Since the TensorFlow recommends Dataset as input pipeline after 1.2.0., I would like to use the Dataset and its iterator to accomplish the task above, namely:
There are two processes, one feeds and the other consumes;
The pipeline suspends either it is full or empty and it stops when computation finishes at consuming.
P.S. Why in the tutorial of Threading and Queues, TensorFlow uses thread instead of process?
Thank you in advance.

Distributed tf.contrib.data pipelines are not yet supported as of TensorFlow 1.3. We are working on support for splitting datasets across devices and/or processes, but that support is not yet ready.
In the meantime, the easiest way to achieve your goal is to use a tf.FIFOQueue. You can define a Dataset that reads from a queue as follows:
q = tf.FIFOQueue(...)
# Define a dummy dataset that contains the same value repeated indefinitely.
dummy = tf.contrib.data.Dataset.from_tensors(0).repeat(None)
dataset_from_queue = dummy.map(lambda _: q.dequeue())
You can then compose other Dataset tranformations with dataset_from_queue.

Related

which one to use to process data for sagemaker batch inferencing pipeline - SKlearnEstimator or SKlearnProcessor

I'm building a Sagemaker batch inferencing pipeline and get confused about the options to process features (before inferencing) between using sagemaker.sklearn.processing.SKLearnProcessor and sagemaker.sklearn.estimator.SKLearn
My understanding of these two options are:
There are docs from aws to use sagemaker.sklearn.estimator.SKLearn to do the batch transformation to process the data.
The pros of using this class and its .create_model() method is that I can incorporate the created model(to process the feature before inferencing) to sagemaker.pipeline.PipelineModel which's deployed on endpoint. so the whole pipeline is behind a single endpoint to be called when inference request input in. This detailed from:
https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-python-sdk/scikit_learn_inference_pipeline/Inference%20Pipeline%20with%20Scikit-learn%20and%20Linear%20Learner.html
I don't know the specific cons, and that's the first question (1).
However, if it's only for data processing, I can also use sagemaker.sklearn.processing.SKLearnProcessor to create Sagemaker Processing jobs to process features, then dump to s3 for model to batch inferencing.
The pros to me is that it's making more sense to me to have a job that designed for processing, but cons is that it seems like I have to write a handler to pipeline the processing and inferencing myself, unlike the sagemaker.sklearn.estimator.SKLearn.
https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation/scikit_learn_data_processing_and_model_evaluation.html
So, my next question (2) is there a way to involve SKLearnProcessor in the sagemaker.pipeline.PipelineModel? if not, the following up question (3) is that if SKLearnProcessor is not designed for using in inferencing, what's the use case of it.
The final question (4) is that from efficiency perspective, what's pros and cons using each method in a Sagemaker batch inferencing pipeline?

SageMaker Inference Pipeline is a functionality of SageMaker hosting whereby you can create a serial inference pipeline (chain of containers) on an endpoint and/or Batch Transform Job.
With regards to the link you shared, a common pattern is to use two containers where one container hosts the Scikit-learn model which will act as the pre-processing step before passing the request onto the second container which hosts the model either on an endpoint or Batch Transform Job.
The SKLearnProcessor is used to kick off a SKLearn Processing Job. You can use the SKLearnProcessor with a processing script to process your data. As such, SKLearnProcessor cannot be used in a Serial Inference Pipeline (sagemaker.pipeline.PipelineModel).
As stated above SKLearnProcessor is designed to kick off a SageMaker Processing Job that makes use of the Scikit-learn container that can be used for data pre- or post-processing and model evaluation workloads. Kindly see this link for more information.
Are you are trying to decide whether to process your data with SKLearnProcessor (Processing Job) or make use of a PipelineModel that contains a preprocessing step in a Batch Transform Job?
If so, making the decision depends on your use case. If you were to use use a Processing Job (SKLearnProcessor) then the Job would need be to kicked off before the Batch Transform Job. Once the Processing Job has completed you can then kick of the Batch Transform Job with the output of the Processing Job as input to the Batch Transform Job.
On the other hand, if you were to use Serial Inference Pipeline (sagemaker.pipeline.PipelineModel) then you would just need to make sure that the first container preprocesses the request to make sure it is compliant with what the model expects. This option would entail the processing being done on a request(s) basis within the Batch Transform Job itself.

What is different between pipeline parallelism and multiprocessing in pytorch?

I'm trying to train a model contains two sub models with two GPUs simultaneously, and someone told me I should take a look on multiprocessing (refer to this).
As I know, parallelism contains data parallelism and model parallelism, in my case is more likely to use model parallelism, and we usually use pipeline together to reduce the waste of transfering data between different model. This can be implemented by using torch.distributed.pipeline.sync package.
On the other hand, due to GIL in python, if we really want to execute multiple threads (in my case, training two models) at the same time, one way is using torch.multiprocessing based on python's multiprocessing package, which will spawn new intepreter to solve the limitation of GIL.
What I question about are below:
What is difference between parallelism and multiprocessing?
In my case which one (or which package) should I use?
I found two tutorial on pytorch's document related to parallelism
(SINGLE-MACHINE MODEL PARALLEL BEST PRACTICES and TRAINING TRANSFORMER MODELS USING PIPELINE PARALLELISM), but I wonder the former one is actually not doing "parallelism", it just split the model and only one model in calculation in one time.
Thanks for anyone who takes a look of my question.

Why should one use tf.train.Server to execute multiple tf.Session() in parallel?

The official way to execute multiple tf.Session() in parallel is to use tf.train.Server as described in Distributed TensorFlow
. On the other hand, the following works for Keras and can be modified to Tensorflow presumably without using tf.train.Server according to Keras + Tensorflow and Multiprocessing in Python.
def _training_worker(train_params):
import keras
model = obtain_model(train_params)
model.fit(train_params)
send_message_to_main_process(...)
def train_new_model(train_params):
training_process = multiprocessing.Process(target=_training_worker, args = train_params)
training_process.start()
get_message_from_training_process(...)
training_process.join()
Is the first method faster than the second method? I have a code written in the second way, and due to the nature of my algorithm (AlphaZero) a single GPU is supposed to run many processes, each of which performs prediction of tiny minibatch.

tf.train.Server is designed for distributed computation within a cluster, when there is a need to communicate between different nodes. This is especially useful when training is distributed across multiple machines or in some cases across multiple GPUs on a single machine. From the documentation:
An in-process TensorFlow server, for use in distributed training.
A tf.train.Server instance encapsulates a set of devices and a tf.Session target that can participate in distributed training. A server belongs to a cluster (specified by a tf.train.ClusterSpec), and corresponds to a particular task in a named job. The server can communicate with any other server in the same cluster.
Spawning multiple processes with multiprocessing.Process isn't a cluster in Tensorflow sense, because the child processes aren't interacting with each other. This method is easier to setup, but it's limited to a single machine. Since you say you have just one machine, this might not be a strong argument, but if you ever plan to scale to a cluster of machines, you'll have to redesign the whole approach.
tf.train.Server is thus a more universal and scalable solution. Besides, it allows to organize complex training with some non-trivial communications, e.g., async gradient updates. Whether it is faster to train or not greatly depends on a task, I don't think there will be a significant difference on one shared GPU.
Just for the reference, here's how the code looks like with the server (between graph replication example):
# specify the cluster's architecture
cluster = tf.train.ClusterSpec({
'ps': ['192.168.1.1:1111'],
'worker': ['192.168.1.2:1111',
'192.168.1.3:1111']
})
# parse command-line to specify machine
job_type = sys.argv[1] # job type: "worker" or "ps"
task_idx = sys.argv[2] # index job in the worker or ps list as defined in the ClusterSpec
# create TensorFlow Server. This is how the machines communicate.
server = tf.train.Server(cluster, job_name=job_type, task_index=task_idx)
# parameter server is updated by remote clients.
# will not proceed beyond this if statement.
if job_type == 'ps':
server.join()
else:
# workers only
with tf.device(tf.train.replica_device_setter(worker_device='/job:worker/task:' + task_idx,
cluster=cluster)):
# build your model here as if you only were using a single machine
pass
with tf.Session(server.target):
# train your model here
pass

keras model.fit_generator() several times slower than model.fit()

Even as of Keras 1.2.2, referencing merge, it does have multiprocessing included, but model.fit_generator() is still about 4-5x slower than model.fit() due to disk reading speed limitations. How can this be sped up, say through additional multiprocessing?

You may want to check out the workers and max_queue_size parameters of fit_generator() in the documentation. Essentially, more workers creates more threads for loading the data into the queue that feeds data to your network. There is a chance that filling the queue might cause memory problems, though, so you might want to decrease max_queue_size to avoid this.

I had a similar problem where I switched to dask to load the data into memory rather than using a generator where I was using pandas. So, depending on your data size, if possible, load the data into memory and use the fit function.

reading a large dataset in tensorflow

I am not quite sure about how file-queue works. I am trying to use a large dataset like imagenet as input. So preloading data is not the case, so I am wondering how to use the file-queue. According to the tutorial, we can convert data to TFRecords file as input. Now we have a single big TFRecords file. So when we specify a FIFO queue for the reader, does it mean the program would fetch a batch of data each time and feed the graph instead of loading the whole file of data?

The amount of pre-fetching depends on your queue capacity. If you use string_input_producer for your filenames and batch for batching, you will have 2 queues - filename queue, and prefetching queue created by batch. Queue created by batch has default capacity of 32, controlled by batch(...,capacity=) argument, therefore it can prefetch up to 32 images. If you follow outline in TensorFlow official howto's, processing examples (everything after batch) will happen in main Python thread, whereas filling up the queue will happen in threads created/started by batch/start_queue_runners, so prefetching new data and running prefetched data through the network will occur concurrently, blocking when the queue gets full or empty.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.