Shared weights with model parallelism in PyTorch

Shared weights with model parallelism in PyTorch - python

Our setup involves initial part of the network (input interface) which run on separate GPU cards. Each GPU gets its own portion of data (model parallelism) and process it separately.
Each input interface, in turn, it itself a complex nn.Module. Every input interface can occupy one or several cards (say, interface_1 runs on GPU 0 and 1, interface_2 - on GPU 2 and 3 and so on).
We need to keep the weights of these input interface the same all over the training. We also need them to run in parallel to save training time which is already weeks for our scenario.
The best idea we can think of was initializing the interfaces with the same weights and then average the gradients for them. As the interfaces are identical, updating same weights with the same gradients should keep them the same all over the training process thus achieving desired “shared weights” mode.
However, I cannot find any good solution for changing values of these weights and their gradients represented as Parameter in PyTorch. Apparently, PyTorch does not allow to do so.
Our current state is: if we copy.deepcopy the ‘parameter.data’ of the “master” interface and assign it to ‘parameter.data’ of the "slave" interface, the values are indeed changed but .to(device_id) does not work and keeps them at the “master” device. However, we need them to move to a “slave” device.
Could someone please tell me if it is possible at all or, if not, if there is a better way to implement shared weights along with the parallel execution for our scenario?

Related

Why does setting backward(retain_graph=True) use up lot GPU memory?

I need to backpropagate through my neural network multiple times, so I set backward(retain_graph=True).
However, this is causing
RuntimeError: CUDA out of memory
I don't understand why this is.
Are the number of variables or weights doubling? Shouldn't the amount of memory used remain the same regardless of how many times backward() is called?

The source of the issue :
You are right that no matter how many times we call the backward function, the memory should not increase theorically.
Yet your issue is not because of the backpropagation, but the retain_graph variable that you have set to true when calling the backward function.
When you run your network by passing a set of input data, you call the forward function, which will create a "computation graph".
A computation graph is containing all the operations that your network has performed.
Then when you call the backward function, the computation graph saved will "basically" be runned backward to know which weight should be adjusted in which directions (what is called the gradients).
So PyTorch is saving in memory the computation graph in order to call the backward function.
After the backward function has been called and the gradients have been calculated, we free the graph from the memory, as explained in the doc https://pytorch.org/docs/stable/autograd.html :
retain_graph (bool, optional) – If False, the graph used to compute the grad will be freed. Note that in nearly all cases setting this option to True is not needed and often can be worked around in a much more efficient way. Defaults to the value of create_graph.
Then usually during training we apply the gradients to the network in order to minimise the loss, then we re-run the network, and so we create a new computation graph. Yet we have only one graph in memory at the same time.
The issue :
If you set retain_graph to true when you call the backward function, you will keep in memory the computation graphs of ALL the previous runs of your network.
And since on every run of your network, you create a new computation graph, if you store them all in memory, you can and will eventually run out of memory.
On the first iteration and run of your network, you will have only one graph in memory. Yet on the 10th run of the network, you have 10 graphs in memory. And on the 10000th run you have 10000 in memory. It is not sustainable, and it is understandable why it is not recommended in the docs.
So even if it may seems that the issue is the backpropagation, it is actually the storing of the computation graphs, and since we usually call the the forward and backward function once per iteration or network run, making a confusion is understandable.
Solution :
What you need to do, is find a way to make your network and architecture work without using retain_graph. Using it will make it almost impossible to train your network, since each iteration increase the usage of your memory and decrease the speed of training, and in your case, even cause you to run out of memory.
You did not mention why you need to backpropagate multiple times, yet it is rarely needed, and i do not know of a case where it cannot be "worked around". For example, if you need to access variables or weights of previous runs you could save them inside variables and later access them, instead of trying doing a new backpropagation.
You likely need to backpropagate multiple times for another reason, yet believe as i have been in this situation, there is likely a way to accomplish what you are trying to do without storing the previous computation graphs.
If you want to share why you need to backpropagate multiple times, maybe others and i could help you more.
More about the backward process :
If you want to learn more about the backward process it is called the "Jacobian-vector product". It is a bit complex and is handled by PyTorch. I do not yet fully understand it, yet this ressource seems good as a starting point, as it seems less intimidating than the PyTorch documentation (in term of algebra) : https://mc.ai/how-pytorch-backward-function-works/

How to attach a tensor to a particular point in the computation graph in PyTorch?

As stated in the question, I need to attach a tensor to a particular point in the computation graph in Pytorch.
What I'm trying to do is this:
while geting outputs from all mini-batches, accumulate them in a list and when one epoch finishes, calculate the mean. Then, I need to calculate loss according to the mean, therefore backpropagation must consider all these operations.
I am able to do that when the training data is not much (without detaching and storing). However, this is not possible when it gets bigger. If I don't detach output tensors each time, I'm running out of GPU memories and if I detach, I lose the track of output tensors from the computation graph. Looks like this is not possible no matter how many GPUs I have since PyTorch does only use first 4 for storing output tensors if I don't detach before saving them into a list even if I assign more than 4 GPUs.
Any help is really appreciated.
Thanks.

I want to use Tensorflow on CPU for everything except back propagation

I recently built my first TensorFlow model (converted from hand-coded python). I'm using tensorflow-gpu, but I only want to use GPU for backprop during training. For everything else I want to use CPU. I've seen this article showing how to force CPU use on a system that will use GPU by default. However, you have to specify every single operation where you want to force CPU use. Instead I'd like to do the opposite. I'd like to default to CPU use, but then specify GPU just for the backprop that I do during training. Is there a way to do that?
Update
Looks like things are just going to run slower over tensorflow because of how my model and scenario are built at present. I tried using a different environment that just uses regular (non-gpu) tensorflow, and it still runs significantly slower than hand-coded python. The reason for this, I suspect, is it's a reinforcement learning model that plays checkers (see below) and makes one single forward prop "prediction" at a time as it plays against a computer opponent. At the time I designed the architecture, that made sense. But it's not very efficient to do predictions one at a time, and less so with whatever overhead there is for tensorflow.
So, now I'm thinking that I'm going to need to change the game playing architecture to play, say, a thousand games simultaneously and run a thousand forward prop moves in a batch. But, man, changing the architecture now is going to be tricky at best.

TensorFlow lets you control device placement with the tf.device context manager.
So for example to run some code on the CPU do
with tf.device('cpu:0'):
<your code goes here>
Similarly to force GPU usage.
Instead of always running your forward pass on the CPU though you're better off making two graphs: a forward-only cpu-only graph to be used when rolling out the policy and a gpu-only forward-and-backward graph to be used when training.

Modify Tensorflow Code to place preprocessing on CPU and training on GPU

I am reading this performance guide on the best practices for optimizing TensorFlow code for GPU. One suggestion they have is to place the preprocessing operations on the CPU so that the GPU is dedicated for training. To try to understand how one would actually implement this within an experiment (ie. learn_runner.run()). To further the discussion, I'd like to consider the best way to apply this strategy to the Custom Estimator Census Sample provided here.
The article suggests placing with tf.device('/cpu:0') around the preprocessing operations. However, when I look at the custom estimator the 'preprocessing' appears to be done in multiple steps:
Line 152/153 inputs = tf.feature_column.input_layer(features, transformed_columns) & label_values = tf.constant(LABELS) -- if I wrapped with tf.device('/cpu:0') around these two lines would that be sufficient to cover the 'preprocessing' in this example?
Line 282/294 - There is also a generate_input_fn and parse_csv function that are used to set up input data queues. Would it be necessary to place with tf.device('/cpu:0') within these functions as well or would that basically be forced by having the inputs & label_values already wrapped?
Main Question: Which of the above implementation suggestions is sufficient to properly place all preprocessing on the CPU?
Some additional questions that aren't addressed in the post:
What if the machine has multiple cores? Would 'cpu:0' be limiting?
The post implies to me that by wrapping the preprocessing on the cpu, the GPU would be automatically used for the rest. Is that actually the case?
Distributed ML Engine Experiment
As a follow up, I would like to understand how this can be further adapted in a distributed ML engine experiment - would any of the recommendations above need to change if there were say 2 worker GPUs, 1 master CPU and a parameter server? My understanding is that the distributed training would be data-parallel asynchronous training so that each worker will be independently iterating through the data (and passing gradients asynchronously back to the PS) which suggests to me that no further modifications from the single GPU above would be needed if you train in this way. However, this seems a bit to easy to be true.

MAIN QUESTION:
The 2 codes your placed actually are 2 different parts of the training, Line 282/294 in my options is so called "pre-processing" part, for it's parse raw input data into Tensors, this operations not suitable for GPU accelerating, so it will be sufficient if allocated on CPU.
Line 152/152 is part of the training model for it's processing the raw feature into different type of features.
'cpu:0' means the operations of this section will be allocated on CPU, but not bind to specified core. The operations allocated on CPU will run in multi-threads and use multi-cores.
If your running machine has GPUs, the TensorFlow will prefer allocating the operations on GPUs if the device is not specified.

The previous answer accurately describes device placement. Allow me to provide an answer to the questions about distributed TF.
The first thing to note is that, whenever possible, prefer a single machine with lots of GPUs to multiple machines with single GPUs. The bandwidth to parameters in RAM on the same machine (or even better, on the GPUs themselves) is orders of magnitude faster than going over the network.
That said, there are times where you'll want distributed training, including remote parameter servers. In that case, you would not necessarily need to change anything in your code from the single machine setup.

PyBrain neuron manipulation

Is there a good way to add/remove a neuron and its associated connections into/from a fully connected PyBrain network? Say I start with:
from pybrain.tools.shortcuts import buildNetwork
net = buildNetwork(2,3,1)
How would I go about making it a (2,4,1) or a (2,2,1) network WHILE maintaining all the old weights (and initializing any new ones to be random as is done when initializing the network)? The reason I want to do this is because I am attempting to use an evolutionary learning strategy to determine the best architecture and the 'mutation' step involves adding/removing nodes with some probability. (The input and output modules should always remain the same.)
edit: I found NeuronDecomposableNetwork which should make this easier, but it still seems that I have to keep track of neurons and connections separately.

I assume you're doing along the lines of the NEAT algorithm?
There are two different answers to your question:
Open ended evolution of the network topology: in this case, I recommend encapsulating every neuron in its own "layer"/module, and add/remove them and their connections to the network iteratively, a bit like in this tutorial, except that there will be many more (single-neuron) layers. Don't forget to call the sortModules() method after each topological change.
Finding the best topology within a predefined framework (say a maximum of 1000 neurons). In that case it's easier and more efficient to build the full network in the beginning, and just mask some of the connections (e.g. using the MaskedParameters module). Among others, memetic algorithms (used like this) are designed to search such topology spaces.
An alternative, as you say, is manually managing all the weights (by tracking what is where, or using NeuronDecomposableNetwork) but I don't recommend that.
A general comment: for more advanced uses of pybrain such as yours, relying on the `buildNetwork' shortcut is really too limited, and you will want to use the Network/Module/Connection API directly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.