I'm using the MirroredStrategy to perform multi-gpu training and it doesn't appear to be properly sharding the data. How do you go about manually sharding data?
I know that I could use the shard method for a tf.data dataset, but for that I need access to the worker ID and I can't figure out how to get that. How do I access the worker ids?
MirroredStrategy runs on a single worker (for multiple workers there is MultiWorkerMirroredStrategy). Because it runs on only one worker, MirroredStrategy runs a single Dataset pipeline without any data sharding. At each step, MirroredStrategy requests one dataset element per worker.
Related
I'm building a Sagemaker batch inferencing pipeline and get confused about the options to process features (before inferencing) between using sagemaker.sklearn.processing.SKLearnProcessor and sagemaker.sklearn.estimator.SKLearn
My understanding of these two options are:
There are docs from aws to use sagemaker.sklearn.estimator.SKLearn to do the batch transformation to process the data.
The pros of using this class and its .create_model() method is that I can incorporate the created model(to process the feature before inferencing) to sagemaker.pipeline.PipelineModel which's deployed on endpoint. so the whole pipeline is behind a single endpoint to be called when inference request input in. This detailed from:
https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-python-sdk/scikit_learn_inference_pipeline/Inference%20Pipeline%20with%20Scikit-learn%20and%20Linear%20Learner.html
I don't know the specific cons, and that's the first question (1).
However, if it's only for data processing, I can also use sagemaker.sklearn.processing.SKLearnProcessor to create Sagemaker Processing jobs to process features, then dump to s3 for model to batch inferencing.
The pros to me is that it's making more sense to me to have a job that designed for processing, but cons is that it seems like I have to write a handler to pipeline the processing and inferencing myself, unlike the sagemaker.sklearn.estimator.SKLearn.
https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation/scikit_learn_data_processing_and_model_evaluation.html
So, my next question (2) is there a way to involve SKLearnProcessor in the sagemaker.pipeline.PipelineModel? if not, the following up question (3) is that if SKLearnProcessor is not designed for using in inferencing, what's the use case of it.
The final question (4) is that from efficiency perspective, what's pros and cons using each method in a Sagemaker batch inferencing pipeline?
SageMaker Inference Pipeline is a functionality of SageMaker hosting whereby you can create a serial inference pipeline (chain of containers) on an endpoint and/or Batch Transform Job.
With regards to the link you shared, a common pattern is to use two containers where one container hosts the Scikit-learn model which will act as the pre-processing step before passing the request onto the second container which hosts the model either on an endpoint or Batch Transform Job.
The SKLearnProcessor is used to kick off a SKLearn Processing Job. You can use the SKLearnProcessor with a processing script to process your data. As such, SKLearnProcessor cannot be used in a Serial Inference Pipeline (sagemaker.pipeline.PipelineModel).
As stated above SKLearnProcessor is designed to kick off a SageMaker Processing Job that makes use of the Scikit-learn container that can be used for data pre- or post-processing and model evaluation workloads. Kindly see this link for more information.
Are you are trying to decide whether to process your data with SKLearnProcessor (Processing Job) or make use of a PipelineModel that contains a preprocessing step in a Batch Transform Job?
If so, making the decision depends on your use case. If you were to use use a Processing Job (SKLearnProcessor) then the Job would need be to kicked off before the Batch Transform Job. Once the Processing Job has completed you can then kick of the Batch Transform Job with the output of the Processing Job as input to the Batch Transform Job.
On the other hand, if you were to use Serial Inference Pipeline (sagemaker.pipeline.PipelineModel) then you would just need to make sure that the first container preprocesses the request to make sure it is compliant with what the model expects. This option would entail the processing being done on a request(s) basis within the Batch Transform Job itself.
The official way to execute multiple tf.Session() in parallel is to use tf.train.Server as described in Distributed TensorFlow
. On the other hand, the following works for Keras and can be modified to Tensorflow presumably without using tf.train.Server according to Keras + Tensorflow and Multiprocessing in Python.
def _training_worker(train_params):
import keras
model = obtain_model(train_params)
model.fit(train_params)
send_message_to_main_process(...)
def train_new_model(train_params):
training_process = multiprocessing.Process(target=_training_worker, args = train_params)
training_process.start()
get_message_from_training_process(...)
training_process.join()
Is the first method faster than the second method? I have a code written in the second way, and due to the nature of my algorithm (AlphaZero) a single GPU is supposed to run many processes, each of which performs prediction of tiny minibatch.
tf.train.Server is designed for distributed computation within a cluster, when there is a need to communicate between different nodes. This is especially useful when training is distributed across multiple machines or in some cases across multiple GPUs on a single machine. From the documentation:
An in-process TensorFlow server, for use in distributed training.
A tf.train.Server instance encapsulates a set of devices and a tf.Session target that can participate in distributed training. A server belongs to a cluster (specified by a tf.train.ClusterSpec), and corresponds to a particular task in a named job. The server can communicate with any other server in the same cluster.
Spawning multiple processes with multiprocessing.Process isn't a cluster in Tensorflow sense, because the child processes aren't interacting with each other. This method is easier to setup, but it's limited to a single machine. Since you say you have just one machine, this might not be a strong argument, but if you ever plan to scale to a cluster of machines, you'll have to redesign the whole approach.
tf.train.Server is thus a more universal and scalable solution. Besides, it allows to organize complex training with some non-trivial communications, e.g., async gradient updates. Whether it is faster to train or not greatly depends on a task, I don't think there will be a significant difference on one shared GPU.
Just for the reference, here's how the code looks like with the server (between graph replication example):
# specify the cluster's architecture
cluster = tf.train.ClusterSpec({
'ps': ['192.168.1.1:1111'],
'worker': ['192.168.1.2:1111',
'192.168.1.3:1111']
})
# parse command-line to specify machine
job_type = sys.argv[1] # job type: "worker" or "ps"
task_idx = sys.argv[2] # index job in the worker or ps list as defined in the ClusterSpec
# create TensorFlow Server. This is how the machines communicate.
server = tf.train.Server(cluster, job_name=job_type, task_index=task_idx)
# parameter server is updated by remote clients.
# will not proceed beyond this if statement.
if job_type == 'ps':
server.join()
else:
# workers only
with tf.device(tf.train.replica_device_setter(worker_device='/job:worker/task:' + task_idx,
cluster=cluster)):
# build your model here as if you only were using a single machine
pass
with tf.Session(server.target):
# train your model here
pass
It is easy to use two threads that one keeps feeding data to the queue and the other consumes data from the queue and perform the computation. Since the TensorFlow recommends Dataset as input pipeline after 1.2.0., I would like to use the Dataset and its iterator to accomplish the task above, namely:
There are two processes, one feeds and the other consumes;
The pipeline suspends either it is full or empty and it stops when computation finishes at consuming.
P.S. Why in the tutorial of Threading and Queues, TensorFlow uses thread instead of process?
Thank you in advance.
Distributed tf.contrib.data pipelines are not yet supported as of TensorFlow 1.3. We are working on support for splitting datasets across devices and/or processes, but that support is not yet ready.
In the meantime, the easiest way to achieve your goal is to use a tf.FIFOQueue. You can define a Dataset that reads from a queue as follows:
q = tf.FIFOQueue(...)
# Define a dummy dataset that contains the same value repeated indefinitely.
dummy = tf.contrib.data.Dataset.from_tensors(0).repeat(None)
dataset_from_queue = dummy.map(lambda _: q.dequeue())
You can then compose other Dataset tranformations with dataset_from_queue.
I've read Distributed Tensorflow Doc, and it mentions that in asynchronous training,
each replica of the graph has an independent training loop that executes without coordination.
From what I understand, if we use parameter-server with data parallelism architecture, it means each worker computes gradients and updates its own weights without caring about other workers updates for distributed training Neural Network. As all weights are shared on parameter server (ps), I think ps still has to coordinate (or aggregate) weight updates from all workers in some way. I wonder how does the aggregation work in asynchronous training. Or in more general words, how does asynchronous training work in distributed Tensorflow?
When you train asynchronously in Distributed TensorFlow, a particular worker does the following:
The worker reads all of the shared model parameters in parallel from the PS task(s), and copies them to the worker task. These reads are uncoordinated with any concurrent writes, and no locks are acquired: in particular the worker may see partial updates from one or more other workers (e.g. a subset of the updates from another worker may have been applied, or a subset of the elements in a variable may have been updated).
The worker computes gradients locally, based on a batch of input data and the parameter values that it read in step 1.
The worker sends the gradients for each variable to the appropriate PS task, and applies the gradients to their respective variable, using an update rule that is determined by the optimization algorithm (e.g. SGD, SGD with Momentum, Adagrad, Adam, etc.). The update rules typically use (approximately) commutative operations, so they may be applied independently on the updates from each worker, and the state of each variable will be a running aggregate of the sequence of updates received.
In asynchronous training, each update from the worker is applied concurrently, and the updates may be somewhat coordinated if the optional use_locking=True flag was set when the respective optimizer (e.g. tf.train.GradientDescentOptimizer) was initialized. Note however that the locking here only provides mutual exclusion for two concurrent updates, and (as noted above) reads do not acquire locks; the locking does not provide atomicity across the entire set of updates.
(By contrast, in synchronous training, a utility like tf.train.SyncReplicasOptimizer will ensure that all of the workers read the same, up-to-date values for each model parameter; and that all of the updates for a synchronous step are aggregated before they are applied to the underlying variables. To do this, the workers are synchronized by a barrier, which they enter after sending their gradient update, and leave after the aggregated update has been applied to all variables.)
In asynchronous training there is no synchronization of weights among the workers. The weights are stored on the parameter server. Each worker loads and changes the shared weights independently from each other. This way if one worker finished an iteration faster than the other workers, it proceeds with the next iteration without waiting. The workers only interact with the shared parameter server and don't interact with each other.
Overall it can (depending on the task) speedup the computation significantly. However the results are sometimes worse than the ones obtained with the slower synchronous updates.
Looking at the example in the documentation you link to:
with tf.device("/job:ps/task:0"):
weights_1 = tf.Variable(...)
biases_1 = tf.Variable(...)
with tf.device("/job:ps/task:1"):
weights_2 = tf.Variable(...)
biases_2 = tf.Variable(...)
with tf.device("/job:worker/task:7"):
input, labels = ...
layer_1 = tf.nn.relu(tf.matmul(input, weights_1) + biases_1)
logits = tf.nn.relu(tf.matmul(layer_1, weights_2) + biases_2)
# ...
train_op = ...
with tf.Session("grpc://worker7.example.com:2222") as sess:
for _ in range(10000):
sess.run(train_op)
You can see that the training is distributed on three machines which all share a copy of identical weights, but as is mentioned just below the example:
In the above example, the variables are created on two tasks in the ps job, and the compute-intensive part of the model is created in the worker job. TensorFlow will insert the appropriate data transfers between the jobs (from ps to worker for the forward pass, and from worker to ps for applying gradients).
In other words, one gpu is used to calculate the forward pass and then transmits the results to the other two machines, while each of the other machines calculate the back propagation for a part of the weights and then send the results to the other machines so they can all update their weights appropriately.
GPUs are used to speed up matrix multiplications and parallel mathematical operations which are very intensive for both forward pass and back propagation. So distributed training simply means that you distribute these operations on many GPUs, the model is still synced between the machines, but now the back propagation of different weights can be calculated in parallel and the forward pass on a different mini-batch can be calculated at the same time as backprop from the previous mini-batch is still being calculated. Distributed training does not mean that you have totally independent models and weights on each machine.
I am not quite sure about how file-queue works. I am trying to use a large dataset like imagenet as input. So preloading data is not the case, so I am wondering how to use the file-queue. According to the tutorial, we can convert data to TFRecords file as input. Now we have a single big TFRecords file. So when we specify a FIFO queue for the reader, does it mean the program would fetch a batch of data each time and feed the graph instead of loading the whole file of data?
The amount of pre-fetching depends on your queue capacity. If you use string_input_producer for your filenames and batch for batching, you will have 2 queues - filename queue, and prefetching queue created by batch. Queue created by batch has default capacity of 32, controlled by batch(...,capacity=) argument, therefore it can prefetch up to 32 images. If you follow outline in TensorFlow official howto's, processing examples (everything after batch) will happen in main Python thread, whereas filling up the queue will happen in threads created/started by batch/start_queue_runners, so prefetching new data and running prefetched data through the network will occur concurrently, blocking when the queue gets full or empty.