I am trying to implement the distributed synchronous SGD approach described in this paper using Tensorflow. For that I need to compute and apply gradients layer-wise. In principle I can do it in the following way (obs! incomplete code:
#WORKER CODE
opt = tf.train.GradientDescentOptimizer(learning_rate)
for layer_vars in all_layer_vars:
grads_vars = opt.compute_gradients(loss, layer_vars)
grads = sess.run([grad_var[0] for grad_var in grads_vars], feed_dict)
send_grads_to_master(zip(grads, layer_vars))
#MASTER CODE
while (True):
grads_vars = receive_grads_from_worker()
sess.run(opt.apply_gradients(grads_vars))
What I wonder is whether in this scenario (with several compute_gradients() calls, within different session.run()'s) the number of internal operations performed by Tensorflow is the same or higher than in the "standard" scenario where all grads are computed with just one invocation of compute_gradients().
That is, thinking on the backpropagation algorithm, I wonder if in this distributed scenario Tensorflow will compute the different "delta's" only once, or not. If the latter, is there a more efficient way of doing what I want?
Related
I have implemented a distributed strategy to train my model on multiple GPUs.
strategy = tf.distribute.MirroredStrategy(devices=devices[:FLAGS.n_gpus])
strategy.run(fn=self.train_step, args=(model, data))
My model now got more complex and bigger and I had to reduce the batch size to fit it onto the GPUs.
The gradient is quite noisy now and I want to increase the batch size again by accumulating gradients.
Now my question is: is this even possible when using a mirrored strategy? I know that loss and gradients are combined across the replicas anyway, so is there a way to sum them across the replicas AND e.g. a loop running over the batches? I tried the straight-forward thing and returned the per replica calculated gradients to add and apply them outside the strategy.run() like that:
for b in batches:
per_replica_gradients = strategy.run(fn=self.train_step, args=(model, data))
total_gradient += per_replica_gradients
optimizer.apply_gradients(zip(total_gradient, model.trainable_variables)
but Tensorflow tells me that this is not possible and the gradients have to be applied withing the strategy.run(). This also makes sense to me but I wonder whether there is a possibility to accumulate gradients AND use a mirrored strategy?
You could use tf.distribute.ReplicaContext.all_reduce: This differs from Strategy.reduce in that it is for replica context and does not copy the results to the host device. all_reduce should be typically used for reductions inside the training step such as gradients.
More details can be found in the document here.
I have a neural network Network that has a vector output. Instead of using a typical loss function, I would like to implement my own loss function that is a method in some class. This looks something like:
class whatever:
def __init__(self, network, optimizer):
self.network = network
self.optimizer = optimizer
def cost_function(relevant_data):
...implementation of cost function with respect to output of network and relevant_data...
def train(self, epochs, other_params):
...part I'm having trouble with...
The main thing I'm concerned with is about taking gradients. Since I'm taking my own custom loss function, do I need to implement my own gradient with respect to the cost function?
Once I do the math, I realize that if the cost is J, then the gradient of J is a fairly simple function in terms of the gradient of the final layer of the Network. I.e, it looks something like: Equation link.
If I used some traditional loss function like CrossEntropy, my backprocess would look like:
objective = nn.CrossEntropyLoss()
for epochs:
optimizer.zero_grad()
output = Network(input)
loss = objective(output, data)
loss.backward()
optimizer.step()
But how do we do this in my case? My guess is something like:
for epochs:
optimizer.zero_grad()
output = Network(input)
loss = cost_function(output, data)
#And here is where the problem comes in
loss.backward()
optimizer.step()
loss.backward() as I understand it, takes the gradients of the loss function with respect to the parameters. But can I still invoke it while using my own loss function (presumably the program doesn't know what the gradient equation is). Do I have to implement another method/subroutine to find the gradients as well?
Which brings me to my other question: if I do want to implement gradient calculation for my loss function, I also need the gradient of the neural network parameters. How do I obtain those? Is there a function for that?
As long as all your steps starting from the input till the loss function involve differentiable operations on PyTorch's tensors, you need not do anything extra. PyTorch builds a computational graph that keeps track of each operation, its inputs, and gradients. So, calling loss.backward() on your custom loss would still propagate gradients back correctly through the graph. A Gentle Introduction to torch.autograd from the PyTorch tutorials may be a useful reference.
After the backward pass, if you need to directly access the gradients for further processing, you can do so using the .grad attribute (so t.grad for tensor t in the graph).
Finally, if you have a specific use case for finding the gradient of an arbitrary differentiable function implemented using PyTorch's tensors with respect to one of its inputs (e.g. gradient of the loss with respect to a particular weight in the network), you could use torch.autograd.grad.
Equalized learning rate is one of the special things in Progressive Gan, a paper of the NVIDIA team. By using this method, they introduced that
Our approach ensures that the dynamic range, and thus the learning speed, is the same for all weights.
In detail, they inited all learnable parameters by normal distribution . During training time, each forward time, they will scale the result with per-layer normalization constant from He's initializer
I reproduced the code from pytorch GAN zoo Github's repo
def forward(self, x, equalized):
# generate He constant depend on the size of tensor W
size = self.module.weight.size()
fan_in = prod(size[1:])
weight = math.sqrt(2.0 / fan_in)
'''
A module example:
import torch.nn as nn
module = nn.Conv2d(nChannelsPrevious, nChannels, kernelSize, padding=padding, bias=bias)
'''
x = self.module(x)
if equalized:
x *= self.weight
return x
At first, I thought the He constant will be as He's paper
Normally, so can be scaled up which leads to the gradient in backpropagation is increased as the formula in ProGan's paper prevent vanishing gradient.
However, the code shows that .
In summary, I can't understand why to scale down the parameter many times during training will help the learning speed be more stable.
I asked this question in some communities e.g: Artificial Intelligent, mathematics, and still haven't had an answer yet.
Please help me explain it, thank you!
There is already an explanation in the paper for the reason for scaling down the parameters in every single pass:
The benefit of doing this dynamically instead of during initialization is somewhat
subtle and relates to the scale-invariance in commonly used adaptive stochastic gradient descent
methods such as RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015). These
methods normalize a gradient update by its estimated standard deviation, thus making the update
independent of the scale of the parameter. As a result, if some parameters have a larger dynamic
range than others, they will take longer to adjust. This is a scenario modern initializers cause, and
thus it is possible that a learning rate is both too large and too small at the same time.
I believe multiplying He's constant in every pass ensures that the range of the parameter will not be too wide at any point in backpropagation and therefore they will not take so long to adjust. So if for example discriminator at some point in the learning process adjusted faster than the generator, it will not take the generator to adjust itself and consequently, the learning process of them will equalize.
In deep learning, you typically have an objective (say, image recognition), that you wish to optimize. In my field (natural language processing), though, we've seen a rise of multitask training. For instance, in next sentence prediction and sentence classification in a single system.
I understand how to build the forward pass, e.g. for a classification task (obj1) and a regression task (obj2)
class Net(nn.Module):
def __init__():
super().__init__()
self.linear = Linear(300, 200)
self.obj1 = Linear(200, 5)
self.obj2 = Linear(200, 1)
def forward(inputs):
out = self.linear(inputs)
out_obj1 = self.obj1(out)
out_obj2 = self.obj2(out)
return out_obj1, out_obj2
But the question then becomes, how does one optimize this. Do you call a backward pass over both losses separately? Or do you reduce them to a single loss (e.g. sum, average)? Is there an approach that is typically used for multi-task learning?
And to follow up on that, perhaps one could even argue that the parameters of the separate layers need different optimizers. In such case, the losses must be dealt with separately, I presume.
It is much simpler, you can optimize all variables at the same time without a problem. Just compute both losses with their respective criterions, add those in a single variable:
total_loss = loss_1 + loss_2
and calling .backward() on this total loss (still a Tensor), works perfectly fine for both. You could also weight the losses to give more importance to one rather than the other.
Check the PyTorch forums for more information.
I am learning how to use Tensorflow and at this 1 particular point I am really stuck and can not make a sense around it. Imagine I have a 5 layer network and the output is represented by output. Now suppose I want to find the gradient of output with respect to layer_2. For that purpose, the code I will write in Tensorflow will be something like:
gradients_i_want = tf.gradients(output, layer_2)
Theoretically, this gradient should be calculated via chain rule. I want to ask, that whether Tensorflow calculates these gradients via chain rule or it will just take the derivative of output with respect to layer_2
Tensorflow will create a graph for your model, where each node is an operation (e.g. addition, multiplication, or a combination of them). Basic ops have manually defined gradient functions, and those functions will be used when applying the chain rule while traveling backwards through the graph.
If you write your own custom op, you might need to also write the corresponding gradient function.