How does asynchronous training work in distributed Tensorflow?

How does asynchronous training work in distributed Tensorflow? - python

I've read Distributed Tensorflow Doc, and it mentions that in asynchronous training,
each replica of the graph has an independent training loop that executes without coordination.
From what I understand, if we use parameter-server with data parallelism architecture, it means each worker computes gradients and updates its own weights without caring about other workers updates for distributed training Neural Network. As all weights are shared on parameter server (ps), I think ps still has to coordinate (or aggregate) weight updates from all workers in some way. I wonder how does the aggregation work in asynchronous training. Or in more general words, how does asynchronous training work in distributed Tensorflow?

When you train asynchronously in Distributed TensorFlow, a particular worker does the following:
The worker reads all of the shared model parameters in parallel from the PS task(s), and copies them to the worker task. These reads are uncoordinated with any concurrent writes, and no locks are acquired: in particular the worker may see partial updates from one or more other workers (e.g. a subset of the updates from another worker may have been applied, or a subset of the elements in a variable may have been updated).
The worker computes gradients locally, based on a batch of input data and the parameter values that it read in step 1.
The worker sends the gradients for each variable to the appropriate PS task, and applies the gradients to their respective variable, using an update rule that is determined by the optimization algorithm (e.g. SGD, SGD with Momentum, Adagrad, Adam, etc.). The update rules typically use (approximately) commutative operations, so they may be applied independently on the updates from each worker, and the state of each variable will be a running aggregate of the sequence of updates received.
In asynchronous training, each update from the worker is applied concurrently, and the updates may be somewhat coordinated if the optional use_locking=True flag was set when the respective optimizer (e.g. tf.train.GradientDescentOptimizer) was initialized. Note however that the locking here only provides mutual exclusion for two concurrent updates, and (as noted above) reads do not acquire locks; the locking does not provide atomicity across the entire set of updates.
(By contrast, in synchronous training, a utility like tf.train.SyncReplicasOptimizer will ensure that all of the workers read the same, up-to-date values for each model parameter; and that all of the updates for a synchronous step are aggregated before they are applied to the underlying variables. To do this, the workers are synchronized by a barrier, which they enter after sending their gradient update, and leave after the aggregated update has been applied to all variables.)

In asynchronous training there is no synchronization of weights among the workers. The weights are stored on the parameter server. Each worker loads and changes the shared weights independently from each other. This way if one worker finished an iteration faster than the other workers, it proceeds with the next iteration without waiting. The workers only interact with the shared parameter server and don't interact with each other.
Overall it can (depending on the task) speedup the computation significantly. However the results are sometimes worse than the ones obtained with the slower synchronous updates.

Looking at the example in the documentation you link to:
with tf.device("/job:ps/task:0"):
weights_1 = tf.Variable(...)
biases_1 = tf.Variable(...)
with tf.device("/job:ps/task:1"):
weights_2 = tf.Variable(...)
biases_2 = tf.Variable(...)
with tf.device("/job:worker/task:7"):
input, labels = ...
layer_1 = tf.nn.relu(tf.matmul(input, weights_1) + biases_1)
logits = tf.nn.relu(tf.matmul(layer_1, weights_2) + biases_2)
# ...
train_op = ...
with tf.Session("grpc://worker7.example.com:2222") as sess:
for _ in range(10000):
sess.run(train_op)
You can see that the training is distributed on three machines which all share a copy of identical weights, but as is mentioned just below the example:
In the above example, the variables are created on two tasks in the ps job, and the compute-intensive part of the model is created in the worker job. TensorFlow will insert the appropriate data transfers between the jobs (from ps to worker for the forward pass, and from worker to ps for applying gradients).
In other words, one gpu is used to calculate the forward pass and then transmits the results to the other two machines, while each of the other machines calculate the back propagation for a part of the weights and then send the results to the other machines so they can all update their weights appropriately.
GPUs are used to speed up matrix multiplications and parallel mathematical operations which are very intensive for both forward pass and back propagation. So distributed training simply means that you distribute these operations on many GPUs, the model is still synced between the machines, but now the back propagation of different weights can be calculated in parallel and the forward pass on a different mini-batch can be calculated at the same time as backprop from the previous mini-batch is still being calculated. Distributed training does not mean that you have totally independent models and weights on each machine.

Related

Can we create an ensemble of deep learning models without increasing the classification time?

I want to improve my ResNet model by creating an ensemble of X number of this model, taking the X best one I have trained. For what I've seen, a technique like bagging will take X time longer to classify an image, which is really not an option in my case.
Is there a way to create an ensemble without increasing the required classifying time? Note that I don't care about increasing the training time, because it only needs to be done one time, compared to the classification which could be made a very large number of time.

There is no magic pill for doing what you want. Extra computation cannot come free.
So one way this can be achieved is by using multiple worker machines to run inference in parallel.
Each model could run on a different machine using tensorflow serving.
For every new inference do the following:
Have a primary machine which takes up the job of running the inference
This primary machine, submits requests to different workers (all of which can run in parallel)
The primary machine collects results from each individual worker, and creates the final output by combining them based upon your ensemble logic.

Depends on the ensembling method; it's an active area of research I suggest you look into, but I'll provide some examples below:
Dropout: trains parts of the model at any given iterations, thus effectively training a multi-NN ensemble
Weights averaging: train X models on X different splits of data to learn different features, average the early-stopped weights (requires advanted treatment)
Lookahead optimizer: automates the above by performing the averaging during training
Parallel weak learners: run X models, but each model taking 1/X the time to process - which can be achieved by e.g. inserting a strides=X convolutional layer at input; best starting bet is at X=2, so you'll average two models' predictions at output, each prediction made in parallel (which can run faster than original single model)
If you have a multi-core CPU, however, multi-model ensembling shouldn't pose much of a problem, as per last bullet, you can run inference concurrently, so inference time shouldn't increase much
More on parallelism: if a single model is large enough, CPU parallelism will no longer help - you'll also need to ensure multiple models can fit in memory (RAM). The alternative then is again a form of downsampling to cut computation

Shared weights with model parallelism in PyTorch

Our setup involves initial part of the network (input interface) which run on separate GPU cards. Each GPU gets its own portion of data (model parallelism) and process it separately.
Each input interface, in turn, it itself a complex nn.Module. Every input interface can occupy one or several cards (say, interface_1 runs on GPU 0 and 1, interface_2 - on GPU 2 and 3 and so on).
We need to keep the weights of these input interface the same all over the training. We also need them to run in parallel to save training time which is already weeks for our scenario.
The best idea we can think of was initializing the interfaces with the same weights and then average the gradients for them. As the interfaces are identical, updating same weights with the same gradients should keep them the same all over the training process thus achieving desired “shared weights” mode.
However, I cannot find any good solution for changing values of these weights and their gradients represented as Parameter in PyTorch. Apparently, PyTorch does not allow to do so.
Our current state is: if we copy.deepcopy the ‘parameter.data’ of the “master” interface and assign it to ‘parameter.data’ of the "slave" interface, the values are indeed changed but .to(device_id) does not work and keeps them at the “master” device. However, we need them to move to a “slave” device.
Could someone please tell me if it is possible at all or, if not, if there is a better way to implement shared weights along with the parallel execution for our scenario?

Why does setting backward(retain_graph=True) use up lot GPU memory?

I need to backpropagate through my neural network multiple times, so I set backward(retain_graph=True).
However, this is causing
RuntimeError: CUDA out of memory
I don't understand why this is.
Are the number of variables or weights doubling? Shouldn't the amount of memory used remain the same regardless of how many times backward() is called?

The source of the issue :
You are right that no matter how many times we call the backward function, the memory should not increase theorically.
Yet your issue is not because of the backpropagation, but the retain_graph variable that you have set to true when calling the backward function.
When you run your network by passing a set of input data, you call the forward function, which will create a "computation graph".
A computation graph is containing all the operations that your network has performed.
Then when you call the backward function, the computation graph saved will "basically" be runned backward to know which weight should be adjusted in which directions (what is called the gradients).
So PyTorch is saving in memory the computation graph in order to call the backward function.
After the backward function has been called and the gradients have been calculated, we free the graph from the memory, as explained in the doc https://pytorch.org/docs/stable/autograd.html :
retain_graph (bool, optional) – If False, the graph used to compute the grad will be freed. Note that in nearly all cases setting this option to True is not needed and often can be worked around in a much more efficient way. Defaults to the value of create_graph.
Then usually during training we apply the gradients to the network in order to minimise the loss, then we re-run the network, and so we create a new computation graph. Yet we have only one graph in memory at the same time.
The issue :
If you set retain_graph to true when you call the backward function, you will keep in memory the computation graphs of ALL the previous runs of your network.
And since on every run of your network, you create a new computation graph, if you store them all in memory, you can and will eventually run out of memory.
On the first iteration and run of your network, you will have only one graph in memory. Yet on the 10th run of the network, you have 10 graphs in memory. And on the 10000th run you have 10000 in memory. It is not sustainable, and it is understandable why it is not recommended in the docs.
So even if it may seems that the issue is the backpropagation, it is actually the storing of the computation graphs, and since we usually call the the forward and backward function once per iteration or network run, making a confusion is understandable.
Solution :
What you need to do, is find a way to make your network and architecture work without using retain_graph. Using it will make it almost impossible to train your network, since each iteration increase the usage of your memory and decrease the speed of training, and in your case, even cause you to run out of memory.
You did not mention why you need to backpropagate multiple times, yet it is rarely needed, and i do not know of a case where it cannot be "worked around". For example, if you need to access variables or weights of previous runs you could save them inside variables and later access them, instead of trying doing a new backpropagation.
You likely need to backpropagate multiple times for another reason, yet believe as i have been in this situation, there is likely a way to accomplish what you are trying to do without storing the previous computation graphs.
If you want to share why you need to backpropagate multiple times, maybe others and i could help you more.
More about the backward process :
If you want to learn more about the backward process it is called the "Jacobian-vector product". It is a bit complex and is handled by PyTorch. I do not yet fully understand it, yet this ressource seems good as a starting point, as it seems less intimidating than the PyTorch documentation (in term of algebra) : https://mc.ai/how-pytorch-backward-function-works/

Why should one use tf.train.Server to execute multiple tf.Session() in parallel?

The official way to execute multiple tf.Session() in parallel is to use tf.train.Server as described in Distributed TensorFlow
. On the other hand, the following works for Keras and can be modified to Tensorflow presumably without using tf.train.Server according to Keras + Tensorflow and Multiprocessing in Python.
def _training_worker(train_params):
import keras
model = obtain_model(train_params)
model.fit(train_params)
send_message_to_main_process(...)
def train_new_model(train_params):
training_process = multiprocessing.Process(target=_training_worker, args = train_params)
training_process.start()
get_message_from_training_process(...)
training_process.join()
Is the first method faster than the second method? I have a code written in the second way, and due to the nature of my algorithm (AlphaZero) a single GPU is supposed to run many processes, each of which performs prediction of tiny minibatch.

tf.train.Server is designed for distributed computation within a cluster, when there is a need to communicate between different nodes. This is especially useful when training is distributed across multiple machines or in some cases across multiple GPUs on a single machine. From the documentation:
An in-process TensorFlow server, for use in distributed training.
A tf.train.Server instance encapsulates a set of devices and a tf.Session target that can participate in distributed training. A server belongs to a cluster (specified by a tf.train.ClusterSpec), and corresponds to a particular task in a named job. The server can communicate with any other server in the same cluster.
Spawning multiple processes with multiprocessing.Process isn't a cluster in Tensorflow sense, because the child processes aren't interacting with each other. This method is easier to setup, but it's limited to a single machine. Since you say you have just one machine, this might not be a strong argument, but if you ever plan to scale to a cluster of machines, you'll have to redesign the whole approach.
tf.train.Server is thus a more universal and scalable solution. Besides, it allows to organize complex training with some non-trivial communications, e.g., async gradient updates. Whether it is faster to train or not greatly depends on a task, I don't think there will be a significant difference on one shared GPU.
Just for the reference, here's how the code looks like with the server (between graph replication example):
# specify the cluster's architecture
cluster = tf.train.ClusterSpec({
'ps': ['192.168.1.1:1111'],
'worker': ['192.168.1.2:1111',
'192.168.1.3:1111']
})
# parse command-line to specify machine
job_type = sys.argv[1] # job type: "worker" or "ps"
task_idx = sys.argv[2] # index job in the worker or ps list as defined in the ClusterSpec
# create TensorFlow Server. This is how the machines communicate.
server = tf.train.Server(cluster, job_name=job_type, task_index=task_idx)
# parameter server is updated by remote clients.
# will not proceed beyond this if statement.
if job_type == 'ps':
server.join()
else:
# workers only
with tf.device(tf.train.replica_device_setter(worker_device='/job:worker/task:' + task_idx,
cluster=cluster)):
# build your model here as if you only were using a single machine
pass
with tf.Session(server.target):
# train your model here
pass

Modify Tensorflow Code to place preprocessing on CPU and training on GPU

I am reading this performance guide on the best practices for optimizing TensorFlow code for GPU. One suggestion they have is to place the preprocessing operations on the CPU so that the GPU is dedicated for training. To try to understand how one would actually implement this within an experiment (ie. learn_runner.run()). To further the discussion, I'd like to consider the best way to apply this strategy to the Custom Estimator Census Sample provided here.
The article suggests placing with tf.device('/cpu:0') around the preprocessing operations. However, when I look at the custom estimator the 'preprocessing' appears to be done in multiple steps:
Line 152/153 inputs = tf.feature_column.input_layer(features, transformed_columns) & label_values = tf.constant(LABELS) -- if I wrapped with tf.device('/cpu:0') around these two lines would that be sufficient to cover the 'preprocessing' in this example?
Line 282/294 - There is also a generate_input_fn and parse_csv function that are used to set up input data queues. Would it be necessary to place with tf.device('/cpu:0') within these functions as well or would that basically be forced by having the inputs & label_values already wrapped?
Main Question: Which of the above implementation suggestions is sufficient to properly place all preprocessing on the CPU?
Some additional questions that aren't addressed in the post:
What if the machine has multiple cores? Would 'cpu:0' be limiting?
The post implies to me that by wrapping the preprocessing on the cpu, the GPU would be automatically used for the rest. Is that actually the case?
Distributed ML Engine Experiment
As a follow up, I would like to understand how this can be further adapted in a distributed ML engine experiment - would any of the recommendations above need to change if there were say 2 worker GPUs, 1 master CPU and a parameter server? My understanding is that the distributed training would be data-parallel asynchronous training so that each worker will be independently iterating through the data (and passing gradients asynchronously back to the PS) which suggests to me that no further modifications from the single GPU above would be needed if you train in this way. However, this seems a bit to easy to be true.

MAIN QUESTION:
The 2 codes your placed actually are 2 different parts of the training, Line 282/294 in my options is so called "pre-processing" part, for it's parse raw input data into Tensors, this operations not suitable for GPU accelerating, so it will be sufficient if allocated on CPU.
Line 152/152 is part of the training model for it's processing the raw feature into different type of features.
'cpu:0' means the operations of this section will be allocated on CPU, but not bind to specified core. The operations allocated on CPU will run in multi-threads and use multi-cores.
If your running machine has GPUs, the TensorFlow will prefer allocating the operations on GPUs if the device is not specified.

The previous answer accurately describes device placement. Allow me to provide an answer to the questions about distributed TF.
The first thing to note is that, whenever possible, prefer a single machine with lots of GPUs to multiple machines with single GPUs. The bandwidth to parameters in RAM on the same machine (or even better, on the GPUs themselves) is orders of magnitude faster than going over the network.
That said, there are times where you'll want distributed training, including remote parameter servers. In that case, you would not necessarily need to change anything in your code from the single machine setup.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.