mixture of experts using tensorflow [duplicate] - python

I am trying to implement a crude method based on the Mixture-of-Experts paper in tensorflow - https://arxiv.org/abs/1701.06538
There would be n models defined:
model_1:
var_11
var_12
loss_1
optimizer_1
model_2:
var_21
var_22
loss_2
optimizer_2
model_3:
var_31
var_32
loss_3
optimizer_3
At every iteration, I want to train the model with the least loss only while keeping the other variables constant. Is it possible to place a switch to execute one of the optimizer only?
P.S: This base of this problem is similar to one I had asked previously. http://stackoverflow.com/questions/42073239/tf-get-collection-to-extract-variables-of-one-scope/42074009?noredirect=1#comment71359330_42074009
Since the suggestion there did not work, I am trying to approach the problem differently.
Thanks in advance!

This seems to be doable with tf.cond:
import tensorflow as tf
def make_conditional_train_op(
should_update, optimizers, variable_lists, losses):
"""Conditionally trains variables.
Each argument is a Python list of Tensors, and each list must have the same
length. Variables are updated based on their optimizer only if the
corresponding `should_update` boolean Tensor is True at a given step.
Returns a single train op which performs the conditional updates.
"""
assert len(optimizers) == len(variable_lists)
assert len(variable_lists) == len(losses)
assert len(should_update) == len(variable_lists)
conditional_updates = []
for model_number, (update_boolean, optimizer, variables, loss) in enumerate(
zip(should_update, optimizers, variable_lists, losses)):
conditional_updates.append(
tf.cond(update_boolean,
lambda: tf.group(
optimizer.minimize(loss, var_list=variables),
tf.Print(0, ["Model {} updating".format(model_number), loss])),
lambda: tf.no_op()))
return tf.group(*conditional_updates)
The basic strategy is to make sure the optimizer's variable updates are defined in the lambda of one of the cond branches, in which case there is true conditional op execution, meaning that the assignment to variables (and optimizer accumulators) only happens if that branch of the cond is triggered.
As an example, we can construct some models:
def make_model_and_optimizer():
scalar_variable = tf.get_variable("scalar", shape=[])
vector_variable = tf.get_variable("vector", shape=[3])
loss = tf.reduce_sum(scalar_variable * vector_variable)
optimizer = tf.train.AdamOptimizer(0.1)
return optimizer, [scalar_variable, vector_variable], loss
# Construct each model
optimizers = []
variable_lists = []
losses = []
for i in range(10):
with tf.variable_scope("model_{}".format(i)):
optimizer, variables, loss = make_model_and_optimizer()
optimizers.append(optimizer)
variable_lists.append(variables)
losses.append(loss)
Then determine a conditional update strategy, in this case only training the model with the maximum loss (just because that results in more switching; the output is rather boring if only one model ever updates):
# Determine which model should be updated (in this case, the one with the
# maximum loss)
integer_one_hot = tf.one_hot(
tf.argmax(tf.stack(losses),
axis=0),
depth=len(losses))
is_max = tf.equal(
integer_one_hot,
tf.ones_like(integer_one_hot))
Finally, we can call the make_conditional_train_op function to create the train op, then do some training iterations:
train_op = make_conditional_train_op(
tf.unstack(is_max), optimizers, variable_lists, losses)
# Repeatedly call the conditional train op
with tf.Session():
tf.global_variables_initializer().run()
for i in range(20):
print("Iteration {}".format(i))
train_op.run()
This is printing the index which is updated and its loss at each iteration, confirming the conditional execution:
Iteration 0
I tensorflow/core/kernels/logging_ops.cc:79] [Model 6 updating][2.7271919]
Iteration 1
I tensorflow/core/kernels/logging_ops.cc:79] [Model 6 updating][2.1755948]
Iteration 2
I tensorflow/core/kernels/logging_ops.cc:79] [Model 2 updating][1.9858969]
Iteration 3
I tensorflow/core/kernels/logging_ops.cc:79] [Model 6 updating][1.6859927]

Related

Intermediate data processing for consecutive networks and backwarding

I have encountered a tricky problem using PyTorch.
I have two networks n1 and n2 to be trained. The structure of my model is something like
def Model():
def forward(self):
output1 = net1(input1) # feed forward of network 1
input2 = function() # preparing input for network 2 containing iterations, which depends on output1
output2 = net2(input2) # feed forward of network 2
def function(self):
# iteration using self.output1
# takes relatively long to run (a few minutes)
def backward(self):
self.loss = loss1 + loss2 # losses based on output1 and output2
self.loss.backward()
The net1 was pre-trained for a few epoches to ensure meaningful output, and everything was correct and running well. Then net2 was added, and two networks are supposed to be trained simultaneously. The observation is that loss.backward() takes too long (around 20 minutes). I assume it has something to do with the function() in the forward() method, because when I changed the iteration to a smaller size, the backwarding was much faster. However, for the complete model, I cannot decrease the iteration size.
So my question is, does the loss.backward() method somehow “re-runs” the function() method in forward()? If it is true, any suggestions to avoid this problem, e.g. changing the structure of my Model?
Many thanks for any help!
Update: the function() is something like
# idx contains the indexes of some pre-defined locations
for b in range(batch_size):
for i in range(150):
for j in range(i, 150):
idx_i, idx_j = idx[i], idx[j]
input2[b, idx_i, idx_j] = some_criterion(output1[b, idx_i-2:idx_i+3, idx_j-2:idx_j+3]) # that is why I need to use iteration, because the output depends on the neighbours
There is a misunderstanding of how 'backward(self)' works. What your code above accomplishes is that your backward-function recursively calls itself.
Simply call loss.backward() whenever you calculate the losses in your training code, no need to define 'backward()' for your model.

Overriding Apply_gradients for custom distributed training

I'm playing with an CNN architecture involving three identical networks that I train with non-overlapping datasets and then coordinate each iteration. Each weight is updated by averaging this weight with the corresponding weight in the other nest and then proportionally adding this weight's current gradient.
I'm using tensorflow 2.2.0 and keras, and I think I'm wanting to override apply_gradients to do it. My first question is Should I be overriding apply_gradients?
Secondly, I have a list of the parameters for each of the models. In apply_gradients, I have a list of gradients and the var_list that goes with it. The gradients are Tensors, and the parameters are Variables. Apply_gradients needs to return an Operation. How do I take a weighted sum (an average) of the parameter variables and then perform the standard Gradient descent, returning an Operation?
Here's my current (commented out) code for apply gradient taken out of my custom optimizer subclass:
def apply_gradients(self,
grads_and_vars,
name=None,
experimental_aggregate_gradients=True):
# Formatting grads_and_vars
grads_and_vars = _filter_grads(grads_and_vars)
var_list = [v for (_, v) in grads_and_vars]
with K.name_scope(self._name):
with ops.init_scope():
self._create_all_weights(var_list)
if not grads_and_vars:
return control_flow_ops.no_op()
strategy = distribute_ctx.get_strategy()
apply_state = self._prepare(var_list)
#Formatting done
#Here's the trouble spot, where I'm trying to update the vars
#CDSGD
grads, var_list = zip(*grads_and_vars)
grads = list(grads)
var_list = list(var_list)
opsR = []
l_r = self._get_hyper("learning_rate")
for i in range(len(grads)):
#base = var_list[i] * 0
#for j in range(3): # 3 Networks, agent_id is that particular network's id 0-2
# base += self.pi[j][self.agent_id] * parameters[j][i]
#parameters holds all the networks' parameters
#var_list[i] = var_list[i].assign(base)
opt1 = training_ops.resource_apply_gradient_descent(var_list[i].handle, l_r, grads[i], use_locking=self._use_locking)
opsR.append(opt1)
return opsR
With this commented out, it trains, but there is no collaboration. I've tried a couple other things, and they all either don't run or don't train.

Does tensorflow provide an operation like caffe average_loss operation?

due to the limiting of gpu, I want to update my weight after every two step training. Specifically, the network will firstly calculate the fisrt batch inputs and save the loss. And then the network calculate the next batch inputs and average these two losses and will update the weights once. It likes average_loss op in caffe, for example()fcn-berkeley . and how to calculate the batchnorm update-ops.
Easy, juste use tf.reduce_mean(input_tensor)
Tf documentation reduce_mean
and in your case, it will be :
loss = tf.concat([loss1,loss2], axis=0)
final_loss = tf.reduce_mean(loss, axis=0)
Please check this thread for correct info on Caffe's average_loss.
You should be able to compute an averaged loss by subclassing LoggingTensorHook in a way like
class MyLoggingTensorHook(tf.train.LoggingTensorHook):
# set every_n_iter to if you want to average last 2 losses
def __init__(self, tensors, every_n_iter):
super().__init__(tensors=tensors, every_n_iter=every_n_iter)
# keep track of previous losses
self.losses=[]
def after_run(self, run_context, run_values):
_ = run_context
# assuming you have a tag like 'average_loss'
# as the name of your loss tensor
for tag in self._tag_order:
if 'average_loss' in tag:
self.losses.append(run_values.results[tag])
if self._should_trigger:
self._log_tensors(run_values.results)
self._iter_count += 1
def _log_tensors(self, tensor_values):
original = np.get_printoptions()
np.set_printoptions(suppress=True)
logging.info("%s = %s" % ('average_loss', np.mean(self.losses)))
np.set_printoptions(**original)
self.losses=[]
and attach it to an estimator's train method or use a TrainSpec.
You should be able to compute gradients of your variables normally in every step, but apply them in every N steps by conditioning on your global_state variable that defines your current iteration or step (you should have initialized this variable in your graph by something like global_step = tf.train.get_or_create_global_step()). Please see the usage of compute_gradients and apply_gradients for this.

Adjust number of samples in Tensorflow nce_loss while training

I'd like to adjust the sampling rate during training of my neural network to test some stuff and see what happens. To achieve that my idea was to create an new loss and optimizer for every iteration using the same computation graph.
def optimize(self, negative_sampling_rate):
return tf.train.GradientDescentOptimizer(self.learning_rate).minimize(self.calc_loss(negative_sampling_rate))
def calc_loss(self, negative_sampling_rate):
return tf.reduce_mean(tf.nn.nce_loss(
weights=self.graph.prediction_weights,
biases=self.graph.prediction_bias,
labels=self.graph.labels,
inputs=self.graph.hidden,
num_sampled=negative_sampling_rate,
num_classes=self.graph.prediction_weights.shape[1])
)
def train(self, batch_inputs, batch_labels, negative_sampling_rate):
feed_dict = {self.graph.X: batch_inputs, self.graph.labels: batch_labels}
_, loss_val = self.session.run(
[self.optimize(negative_sampling_rate), self.calc_loss(negative_sampling_rate)], feed_dict=feed_dict
)
return loss_val
But I'm a little bit worried about the optimizer. I've heard that optimizers have internal variables, which change on every training iteration. Is that true for all optimizers or only or a some, and if so which ones are usable for this approach?
The Neural network should then be trained like:
for step in range(training_steps):
NN.train(inputs, labels, compute_sampling_rate(step))
First of all, it should be okay to change the number of samples in the nce loss without causing problems for the optimizer. The internal variables stored by some optimizers relate to the historical gradients of the trainable variables in your graph.
Secondly, if you do want to reset the state of your optimizer for some reason, the way I do it is by putting the optimizer in a variable scope. Whenever I want to reset it then I run the reset_optimizer op.
reset_optimizer = tf.no_op()
with tf.variable_scope('optimizer'):
train_op = optimizer(learning_rate).minimize(loss)
opt_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, "optimizer")
if len(opt_vars): # check if optimzer state needs resetting
reset_optimizer = variables_initializer(opt_vars))

Why do we need to call zero_grad() in PyTorch?

Why does zero_grad() need to be called during training?
| zero_grad(self)
| Sets gradients of all model parameters to zero.
In PyTorch, for every mini-batch during the training phase, we typically want to explicitly set the gradients to zero before starting to do backpropragation (i.e., updating the Weights and biases) because PyTorch accumulates the gradients on subsequent backward passes. This accumulating behaviour is convenient while training RNNs or when we want to compute the gradient of the loss summed over multiple mini-batches. So, the default action has been set to accumulate (i.e. sum) the gradients on every loss.backward() call.
Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly. Otherwise, the gradient would be a combination of the old gradient, which you have already used to update your model parameters, and the newly-computed gradient. It would therefore point in some other direction than the intended direction towards the minimum (or maximum, in case of maximization objectives).
Here is a simple example:
import torch
from torch.autograd import Variable
import torch.optim as optim
def linear_model(x, W, b):
return torch.matmul(x, W) + b
data, targets = ...
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
optimizer = optim.Adam([W, b])
for sample, target in zip(data, targets):
# clear out the gradients of all Variables
# in this optimizer (i.e. W, b)
optimizer.zero_grad()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
optimizer.step()
Alternatively, if you're doing a vanilla gradient descent, then:
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
for sample, target in zip(data, targets):
# clear out the gradients of Variables
# (i.e. W, b)
W.grad.data.zero_()
b.grad.data.zero_()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
W -= learning_rate * W.grad.data
b -= learning_rate * b.grad.data
Note:
The accumulation (i.e., sum) of gradients happens when .backward() is called on the loss tensor.
As of v1.7.0, Pytorch offers the option to reset the gradients to None optimizer.zero_grad(set_to_none=True) instead of filling them with a tensor of zeroes. The docs claim that this setting reduces memory requirements and slightly improves performance, but might be error-prone if not handled carefully.
Although the idea can be derived from the chosen answer, but I feel like I want to write that explicitly.
Being able to decide when to call optimizer.zero_grad() and optimizer.step() provides more freedom on how gradient is accumulated and applied by the optimizer in the training loop. This is crucial when the model or input data is big and one actual training batch do not fit in to the gpu card.
Here in this example from google-research, there are two arguments, named train_batch_size and gradient_accumulation_steps.
train_batch_size is the batch size for the forward pass, following the loss.backward(). This is limited by the gpu memory.
gradient_accumulation_steps is the actual training batch size, where loss from multiple forward pass is accumulated. This is NOT limited by the gpu memory.
From this example, you can see how optimizer.zero_grad() may followed by optimizer.step() but NOT loss.backward(). loss.backward() is invoked in every single iteration (line 216) but optimizer.zero_grad() and optimizer.step() is only invoked when the number of accumulated train batch equals the gradient_accumulation_steps (line 227 inside the if block in line 219)
https://github.com/google-research/xtreme/blob/master/third_party/run_classify.py
Also someone is asking about equivalent method in TensorFlow. I guess tf.GradientTape serve the same purpose.
(I am still new to AI library, please correct me if anything I said is wrong)
zero_grad() restarts looping without losses from the last step if you use the gradient method for decreasing the error (or losses).
If you do not use zero_grad() the loss will increase not decrease as required.
For example:
If you use zero_grad() you will get the following output:
model training loss is 1.5
model training loss is 1.4
model training loss is 1.3
model training loss is 1.2
If you do not use zero_grad() you will get the following output:
model training loss is 1.4
model training loss is 1.9
model training loss is 2
model training loss is 2.8
model training loss is 3.5
You don't have to call grad_zero() alternatively one can decay the gradients for example:
optimizer = some_pytorch_optimizer
# decay the grads :
for group in optimizer.param_groups:
for p in group['params']:
if p.grad is not None:
''' original code from git:
if set_to_none:
p.grad = None
else:
if p.grad.grad_fn is not None:
p.grad.detach_()
else:
p.grad.requires_grad_(False)
p.grad.zero_()
'''
p.grad = p.grad / 2
this way the learning is much more continues
During the feed forward propagation the weights are assigned to inputs and after the 1st iteration the weights are initialized what the model has learnt seeing the samples(inputs). And when we start back propagation we want to update weights in order to get minimum loss of our cost function. So we clear off our previous weights in order to obtained more better weights. This we keep doing in training and we do not perform this in testing because we have got the weights in training time which is best fitted in our data. Hope this would clear more!
In simple terms We need ZERO_GRAD
because when we start a training loop we do not want past gardients or past results to interfere with our current results beacuse how PyTorch works as it collects/accumulates the gradients on backpropagation and if the past results may mixup and give us the wrong results so we set the gradient to zero every time we go through the loop.
Here is a example:
`
# let us write a training loop
torch.manual_seed(42)
epochs = 200
for epoch in range(epochs):
model_1.train()
y_pred = model_1(X_train)
loss = loss_fn(y_pred,y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
`
In this for loop if we do not set the optimizer to zero every time the past value it may get add up and changes the result.
So we use zero_grad to not face the wrong accumulated results.

Categories

Resources