Overriding Apply_gradients for custom distributed training - python

I'm playing with an CNN architecture involving three identical networks that I train with non-overlapping datasets and then coordinate each iteration. Each weight is updated by averaging this weight with the corresponding weight in the other nest and then proportionally adding this weight's current gradient.
I'm using tensorflow 2.2.0 and keras, and I think I'm wanting to override apply_gradients to do it. My first question is Should I be overriding apply_gradients?
Secondly, I have a list of the parameters for each of the models. In apply_gradients, I have a list of gradients and the var_list that goes with it. The gradients are Tensors, and the parameters are Variables. Apply_gradients needs to return an Operation. How do I take a weighted sum (an average) of the parameter variables and then perform the standard Gradient descent, returning an Operation?
Here's my current (commented out) code for apply gradient taken out of my custom optimizer subclass:
def apply_gradients(self,
grads_and_vars,
name=None,
experimental_aggregate_gradients=True):
# Formatting grads_and_vars
grads_and_vars = _filter_grads(grads_and_vars)
var_list = [v for (_, v) in grads_and_vars]
with K.name_scope(self._name):
with ops.init_scope():
self._create_all_weights(var_list)
if not grads_and_vars:
return control_flow_ops.no_op()
strategy = distribute_ctx.get_strategy()
apply_state = self._prepare(var_list)
#Formatting done
#Here's the trouble spot, where I'm trying to update the vars
#CDSGD
grads, var_list = zip(*grads_and_vars)
grads = list(grads)
var_list = list(var_list)
opsR = []
l_r = self._get_hyper("learning_rate")
for i in range(len(grads)):
#base = var_list[i] * 0
#for j in range(3): # 3 Networks, agent_id is that particular network's id 0-2
# base += self.pi[j][self.agent_id] * parameters[j][i]
#parameters holds all the networks' parameters
#var_list[i] = var_list[i].assign(base)
opt1 = training_ops.resource_apply_gradient_descent(var_list[i].handle, l_r, grads[i], use_locking=self._use_locking)
opsR.append(opt1)
return opsR
With this commented out, it trains, but there is no collaboration. I've tried a couple other things, and they all either don't run or don't train.

Related

Does tensorflow provide an operation like caffe average_loss operation?

due to the limiting of gpu, I want to update my weight after every two step training. Specifically, the network will firstly calculate the fisrt batch inputs and save the loss. And then the network calculate the next batch inputs and average these two losses and will update the weights once. It likes average_loss op in caffe, for example()fcn-berkeley . and how to calculate the batchnorm update-ops.
Easy, juste use tf.reduce_mean(input_tensor)
Tf documentation reduce_mean
and in your case, it will be :
loss = tf.concat([loss1,loss2], axis=0)
final_loss = tf.reduce_mean(loss, axis=0)
Please check this thread for correct info on Caffe's average_loss.
You should be able to compute an averaged loss by subclassing LoggingTensorHook in a way like
class MyLoggingTensorHook(tf.train.LoggingTensorHook):
# set every_n_iter to if you want to average last 2 losses
def __init__(self, tensors, every_n_iter):
super().__init__(tensors=tensors, every_n_iter=every_n_iter)
# keep track of previous losses
self.losses=[]
def after_run(self, run_context, run_values):
_ = run_context
# assuming you have a tag like 'average_loss'
# as the name of your loss tensor
for tag in self._tag_order:
if 'average_loss' in tag:
self.losses.append(run_values.results[tag])
if self._should_trigger:
self._log_tensors(run_values.results)
self._iter_count += 1
def _log_tensors(self, tensor_values):
original = np.get_printoptions()
np.set_printoptions(suppress=True)
logging.info("%s = %s" % ('average_loss', np.mean(self.losses)))
np.set_printoptions(**original)
self.losses=[]
and attach it to an estimator's train method or use a TrainSpec.
You should be able to compute gradients of your variables normally in every step, but apply them in every N steps by conditioning on your global_state variable that defines your current iteration or step (you should have initialized this variable in your graph by something like global_step = tf.train.get_or_create_global_step()). Please see the usage of compute_gradients and apply_gradients for this.

How is embedding matrix being trained in this code snippet?

I'm following the code of a coursera assignment which implements a NER tagger using a bidirectional LSTM.
But I'm not able to understand how the embedding matrix is being updated. In the following code, build_layers has a variable embedding_matrix_variable which acts an input the the LSTM. However it's not getting updated anywhere.
Can you help me understand how embeddings are being trained?
def build_layers(self, vocabulary_size, embedding_dim, n_hidden_rnn, n_tags):
initial_embedding_matrix = np.random.randn(vocabulary_size, embedding_dim) / np.sqrt(embedding_dim)
embedding_matrix_variable = tf.Variable(initial_embedding_matrix, name='embedding_matrix', dtype=tf.float32)
forward_cell = tf.nn.rnn_cell.DropoutWrapper(
tf.nn.rnn_cell.BasicLSTMCell(num_units=n_hidden_rnn, forget_bias=3.0),
input_keep_prob=self.dropout_ph,
output_keep_prob=self.dropout_ph,
state_keep_prob=self.dropout_ph
)
backward_cell = tf.nn.rnn_cell.DropoutWrapper(
tf.nn.rnn_cell.BasicLSTMCell(num_units=n_hidden_rnn, forget_bias=3.0),
input_keep_prob=self.dropout_ph,
output_keep_prob=self.dropout_ph,
state_keep_prob=self.dropout_ph
)
embeddings = tf.nn.embedding_lookup(embedding_matrix_variable, self.input_batch)
(rnn_output_fw, rnn_output_bw), _ = tf.nn.bidirectional_dynamic_rnn(
cell_fw=forward_cell, cell_bw=backward_cell,
dtype=tf.float32,
inputs=embeddings,
sequence_length=self.lengths
)
rnn_output = tf.concat([rnn_output_fw, rnn_output_bw], axis=2)
self.logits = tf.layers.dense(rnn_output, n_tags, activation=None)
def compute_loss(self, n_tags, PAD_index):
"""Computes masked cross-entopy loss with logits."""
ground_truth_tags_one_hot = tf.one_hot(self.ground_truth_tags, n_tags)
loss_tensor = tf.nn.softmax_cross_entropy_with_logits(labels=ground_truth_tags_one_hot, logits=self.logits)
mask = tf.cast(tf.not_equal(self.input_batch, PAD_index), tf.float32)
self.loss = tf.reduce_mean(tf.reduce_sum(tf.multiply(loss_tensor, mask), axis=-1) / tf.reduce_sum(mask, axis=-1))
In TensorFlow, variables are not usually updated directly (i.e. by manually setting them to a certain value), but rather they are trained using an optimization algorithm and automatic differentiation.
When you define a tf.Variable, you are adding a node (that maintains a state) to the computational graph. At training time, if the loss node depends on the state of the variable that you defined, TensorFlow will compute the gradient of the loss function with respect to that variable by automatically following the chain rule through the computational graph. Then, the optimization algorithm will make use of the computed gradients to update the values of the trainable variables that took part in the computation of the loss.
Concretely, the code that you provide builds a TensorFlow graph in which the loss self.loss depends on the weights in embedding_matrix_variable (i.e. there is a path between these nodes in the graph), so TensorFlow will compute the gradient with respect to this variable, and the optimizer will update its values when minimizing the loss. It might be useful to inspect the TensorFlow graph using TensorBoard.

mixture of experts using tensorflow [duplicate]

I am trying to implement a crude method based on the Mixture-of-Experts paper in tensorflow - https://arxiv.org/abs/1701.06538
There would be n models defined:
model_1:
var_11
var_12
loss_1
optimizer_1
model_2:
var_21
var_22
loss_2
optimizer_2
model_3:
var_31
var_32
loss_3
optimizer_3
At every iteration, I want to train the model with the least loss only while keeping the other variables constant. Is it possible to place a switch to execute one of the optimizer only?
P.S: This base of this problem is similar to one I had asked previously. http://stackoverflow.com/questions/42073239/tf-get-collection-to-extract-variables-of-one-scope/42074009?noredirect=1#comment71359330_42074009
Since the suggestion there did not work, I am trying to approach the problem differently.
Thanks in advance!
This seems to be doable with tf.cond:
import tensorflow as tf
def make_conditional_train_op(
should_update, optimizers, variable_lists, losses):
"""Conditionally trains variables.
Each argument is a Python list of Tensors, and each list must have the same
length. Variables are updated based on their optimizer only if the
corresponding `should_update` boolean Tensor is True at a given step.
Returns a single train op which performs the conditional updates.
"""
assert len(optimizers) == len(variable_lists)
assert len(variable_lists) == len(losses)
assert len(should_update) == len(variable_lists)
conditional_updates = []
for model_number, (update_boolean, optimizer, variables, loss) in enumerate(
zip(should_update, optimizers, variable_lists, losses)):
conditional_updates.append(
tf.cond(update_boolean,
lambda: tf.group(
optimizer.minimize(loss, var_list=variables),
tf.Print(0, ["Model {} updating".format(model_number), loss])),
lambda: tf.no_op()))
return tf.group(*conditional_updates)
The basic strategy is to make sure the optimizer's variable updates are defined in the lambda of one of the cond branches, in which case there is true conditional op execution, meaning that the assignment to variables (and optimizer accumulators) only happens if that branch of the cond is triggered.
As an example, we can construct some models:
def make_model_and_optimizer():
scalar_variable = tf.get_variable("scalar", shape=[])
vector_variable = tf.get_variable("vector", shape=[3])
loss = tf.reduce_sum(scalar_variable * vector_variable)
optimizer = tf.train.AdamOptimizer(0.1)
return optimizer, [scalar_variable, vector_variable], loss
# Construct each model
optimizers = []
variable_lists = []
losses = []
for i in range(10):
with tf.variable_scope("model_{}".format(i)):
optimizer, variables, loss = make_model_and_optimizer()
optimizers.append(optimizer)
variable_lists.append(variables)
losses.append(loss)
Then determine a conditional update strategy, in this case only training the model with the maximum loss (just because that results in more switching; the output is rather boring if only one model ever updates):
# Determine which model should be updated (in this case, the one with the
# maximum loss)
integer_one_hot = tf.one_hot(
tf.argmax(tf.stack(losses),
axis=0),
depth=len(losses))
is_max = tf.equal(
integer_one_hot,
tf.ones_like(integer_one_hot))
Finally, we can call the make_conditional_train_op function to create the train op, then do some training iterations:
train_op = make_conditional_train_op(
tf.unstack(is_max), optimizers, variable_lists, losses)
# Repeatedly call the conditional train op
with tf.Session():
tf.global_variables_initializer().run()
for i in range(20):
print("Iteration {}".format(i))
train_op.run()
This is printing the index which is updated and its loss at each iteration, confirming the conditional execution:
Iteration 0
I tensorflow/core/kernels/logging_ops.cc:79] [Model 6 updating][2.7271919]
Iteration 1
I tensorflow/core/kernels/logging_ops.cc:79] [Model 6 updating][2.1755948]
Iteration 2
I tensorflow/core/kernels/logging_ops.cc:79] [Model 2 updating][1.9858969]
Iteration 3
I tensorflow/core/kernels/logging_ops.cc:79] [Model 6 updating][1.6859927]

TensorFlow: adding regularization to LSTM

Following Tensorflow LSTM Regularization I am trying to add regularization term to the cost function when training parameters of LSTM cells.
Putting aside some constants I have:
def RegularizationCost(trainable_variables):
cost = 0
for v in trainable_variables:
cost += r(tf.reduce_sum(tf.pow(r(v.name),2)))
return cost
...
regularization_cost = tf.placeholder(tf.float32, shape = ())
cost = tf.reduce_sum(tf.pow(pred - y, 2)) + regularization_cost
optimizer = tf.train.AdamOptimizer(learning_rate = 0.01).minimize(cost)
...
tv = tf.trainable_variables()
s = tf.Session()
r = s.run
...
while (...):
...
reg_cost = RegularizationCost(tv)
r(optimizer, feed_dict = {x: x_b, y: y_b, regularization_cost: reg_cost})
The problem I have is that adding the regularization term hugely slows the learning process and actually the regularization term reg_cost is increasing with each iteration visibly when the term associated with pred - y pretty much stagnated i.e. the reg_cost seems not to be taken into account.
As I suspect I am adding this term in completely wrong way. I did not know how to add this term in the cost function itself so I used a workaround with scalar tf.placeholder and "manually" calculated the regularization cost. How to do it properly?
compute the L2 loss only once:
tv = tf.trainable_variables()
regularization_cost = tf.reduce_sum([ tf.nn.l2_loss(v) for v in tv ])
cost = tf.reduce_sum(tf.pow(pred - y, 2)) + regularization_cost
optimizer = tf.train.AdamOptimizer(learning_rate = 0.01).minimize(cost)
you might want to remove the variables that are bias as those should not be regularized.
It slows down because your code creates new nodes in every iteration. This is not how you code with TF. First, you create your whole graph, including regularization terms, then, in the while loop you only execute them, each "tf.XXX" operation creates new nodes.

Implementing batch normalization with tensorflow

I am trying to implement a batch normalization layer in tensor-flow. I am having no problem running the train step of this using tf.moments to get the mean and variance.
For test time, I'd like to set up an exponential moving average to track the mean and variance. I am trying to do it like this:
def batch_normalized_linear_layer(state_below, scope_name, n_inputs, n_outputs, stddev, wd, eps=.0001):
with tf.variable_scope(scope_name) as scope:
weight = _variable_with_weight_decay(
"weights", shape=[n_inputs, n_outputs],
stddev=stddev, wd=wd
)
act = tf.matmul(state_below, weight)
# get moments
act_mean, act_variance = tf.nn.moments(act, [0])
# get mean and variance variables
mean = _variable_on_cpu('bn_mean', [n_outputs], tf.constant_initializer(0.0))
variance = _variable_on_cpu('bn_variance', [n_outputs], tf.constant_initializer(1.0))
# assign the moments
assign_mean = mean.assign(act_mean)
assign_variance = variance.assign(act_variance)
act_bn = tf.mul((act - mean), tf.rsqrt(variance + eps), name=scope.name+"_bn")
beta = _variable_on_cpu("beta", [n_outputs], tf.constant_initializer(0.0))
gamma = _variable_on_cpu("gamma", [n_outputs], tf.constant_initializer(1.0))
bn = tf.add(tf.mul(act_bn, gamma), beta)
output = tf.nn.relu(bn, name=scope.name)
_activation_summary(output)
return output, mean, variance
Where _variable_on_cpu is defined as:
def _variable_on_cpu(name, shape, initializer):
"""Helper to create a Variable stored on CPU memory.
Args:
name: name of the variable
shape: list of ints
initializer: initializer for Variable
Returns:
Variable Tensor
"""
with tf.device('/cpu:0'):
var = tf.get_variable(name, shape, initializer=initializer)
return var
I believe that I am setting
assign_mean = mean.assign(act_mean)
assign_variance = variance.assign(act_variance)
Incorrectly, but I am not sure how. When I use tensorboard to track these mean and variance variables, they are just flat that their initialized values.
Rafal's comment gets at the core of the problem: You're not running the assign nodes. You might try using the batchnorm helper I posted in another answer - How could I use Batch Normalization in TensorFlow? - or you can force the assign to happen by adding with_dependencies, as he suggests.
The general principle is that you should only count on a node being run if data or control dependencies flow "through" it. with_dependencies ensures that before the output op is used, the specified dependencies will have completed.

Categories

Resources