Computing the weight-update-ratio in Tensorflow

Computing the weight-update-ratio in Tensorflow - python

I'm searching for a way to compute the weight-update-ratio for optimizer steps in Tensorflow. The weight-update-ratio is defined as the update-scale divided by the variable scale in each step and can be used for inspecting network training.
Ideally I want a non-intrusive way to compute it in a single session run, but couldn't accomplish quite what I was looking for. Since the update-scale and parameter scale are independent of the train step, one needs to add explicit dependencies to the graph in order to graph variable-scale before and after the update step. Unfortunately, it seems that in TF dependencies can only be defined for new nodes, which further complicates the issue.
So far, the best I've come up with is a context manager for definining the necessary operations. Its used as follows
opt = tf.train.AdamOptimizer(1e0)
grads = tf.gradients(loss, tf.trainable_variables())
grads = list(zip(grads, tf.trainable_variables()))
with compute_weight_update_ratio('wur') as wur:
train = opt.apply_gradients(grads_and_vars=grads)
# ...
with tf.Session() as sess:
sess.run(wur.ratio)
The full code of compute_weight_update_ratio can be found below. What bugs me is that in the current state the weight-update-ratio (at least norm_before) is computed with every training step, but for performance reason I'd rather prefer to do it selectively (e.g only when summaries are computed).
Any ideas on how to improve?
#contextlib.contextmanager
def compute_weight_update_ratio(name, var_scope=None):
'''Injects training to compute weight-update-ratio.
The weight-update-ratio is computed as the update scale divided
by the variable scale before the update and should be somewhere in the
range 1e-2 or 1e-3.
Params
------
name : str
Operation name
Kwargs
------
var_scope : str, optional
Name selection of variables to compute weight-update-ration for. Defaults to all. Regex supported.
'''
class WeightUpdateRatio:
def __init__(self):
self.num_train = len(tf.get_collection(tf.GraphKeys.TRAIN_OP))
self.variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=var_scope)
self.norm_before = tf.norm(self.variables, name='norm_before')
def compute_ratio(self,):
train_ops = tf.get_collection(tf.GraphKeys.TRAIN_OP)
assert len(train_ops) > self.num_train, 'Missing training op'
with tf.control_dependencies(train_ops[self.num_train:]):
self.norm_after = tf.norm(self.variables, name='norm_after')
absdiff = tf.abs(tf.subtract(self.norm_after, self.norm_before), name='absdiff')
self.ratio = tf.divide(absdiff, self.norm_before, name=name)
with tf.name_scope(name) as scope:
try:
wur = WeightUpdateRatio()
with tf.control_dependencies([wur.norm_before]):
yield wur
finally:
wur.compute_ratio()

You don't need to worry about performance too much. Tensorflow only executes the subgraph necessary to produce the output.
So, in your training loop, if wur.ratio is not called during an iteration, none of the extra nodes created to compute it will be executed.

Related

Does tensorflow provide an operation like caffe average_loss operation?

due to the limiting of gpu, I want to update my weight after every two step training. Specifically, the network will firstly calculate the fisrt batch inputs and save the loss. And then the network calculate the next batch inputs and average these two losses and will update the weights once. It likes average_loss op in caffe, for example()fcn-berkeley . and how to calculate the batchnorm update-ops.

Easy, juste use tf.reduce_mean(input_tensor)
Tf documentation reduce_mean
and in your case, it will be :
loss = tf.concat([loss1,loss2], axis=0)
final_loss = tf.reduce_mean(loss, axis=0)

Please check this thread for correct info on Caffe's average_loss.
You should be able to compute an averaged loss by subclassing LoggingTensorHook in a way like
class MyLoggingTensorHook(tf.train.LoggingTensorHook):
# set every_n_iter to if you want to average last 2 losses
def __init__(self, tensors, every_n_iter):
super().__init__(tensors=tensors, every_n_iter=every_n_iter)
# keep track of previous losses
self.losses=[]
def after_run(self, run_context, run_values):
_ = run_context
# assuming you have a tag like 'average_loss'
# as the name of your loss tensor
for tag in self._tag_order:
if 'average_loss' in tag:
self.losses.append(run_values.results[tag])
if self._should_trigger:
self._log_tensors(run_values.results)
self._iter_count += 1
def _log_tensors(self, tensor_values):
original = np.get_printoptions()
np.set_printoptions(suppress=True)
logging.info("%s = %s" % ('average_loss', np.mean(self.losses)))
np.set_printoptions(**original)
self.losses=[]
and attach it to an estimator's train method or use a TrainSpec.
You should be able to compute gradients of your variables normally in every step, but apply them in every N steps by conditioning on your global_state variable that defines your current iteration or step (you should have initialized this variable in your graph by something like global_step = tf.train.get_or_create_global_step()). Please see the usage of compute_gradients and apply_gradients for this.

Compute gradients for each time step of tf.while_loop

Given a TensorFlow tf.while_loop, how can I calculate the gradient of x_out with respect to all weights of the network for each time step?
network_input = tf.placeholder(tf.float32, [None])
steps = tf.constant(0.0)
weight_0 = tf.Variable(1.0)
layer_1 = network_input * weight_0
def condition(steps, x):
return steps <= 5
def loop(steps, x_in):
weight_1 = tf.Variable(1.0)
x_out = x_in * weight_1
steps += 1
return [steps, x_out]
_, x_final = tf.while_loop(
condition,
loop,
[steps, layer_1]
)
Some notes
In my network the condition is dynamic. Different runs are going to run the while loop a different amount of times.
Calling tf.gradients(x, tf.trainable_variables()) crashes with AttributeError: 'WhileContext' object has no attribute 'pred'. It seems like the only possibility to use tf.gradients within the loop is to calculate the gradient with respect to weight_1 and the current value of x_in / time step only without backpropagating through time.
In each time step, the network is going to output a probability distribution over actions. The gradients are then needed for a policy gradient implementation.

You can't ever call tf.gradients inside tf.while_loop in Tensorflow based on this and this, I found this out the hard way when I was trying to create conjugate gradient descent entirely into the Tensorflow graph.
But if I understand your model correctly, you could make your own version of an RNNCell and wrap it in a tf.dynamic_rnn, but the actual cell
implementation will be a little complex since you need to evaluate a condition dynamically at runtime.
For starters, you can take a look at Tensorflow's dynamic_rnn code here.
Alternatively, dynamic graphs have never been Tensorflow's strong suite, so consider using other frameworks like PyTorch or you can try out eager_execution and see if that helps.

Replacing gradients of Tensorflow function with output of another function

I am using a function consisting of compound Tensorflow operations. However, instead of letting Tensorflow automatically compute its derivatives with respect to one of the inputs, I would like to replace the gradients with a different computation on the same input. Moreover, some of the calculation is shared between the forward and backward pass. For example:
def func(in1, in2):
# do something with inputs using only tf operations
shared_rep = tf.op1(tf.op2(tf.op3(in1, in2))) # same computation for both forward and gradient pass
# return output of forward computation
return tf.op4(shared_rep)
def func_grad(in1, in2):
shared_rep = tf.op1(tf.op2(tf.op3(in1, in2)))
# explicitly calculate gradients with respect to in1, with the intention of replacing the gradients computed by Tensorflow
mygrad1 = tf.op5(tf.op6(shared_rep))
return mygrad1
in1 = tf.Variable([1,2,3])
in2 = tf.Variable([2.5,0.01])
func_val = func(in1, in2)
my_grad1 = func_grad(in1, in2)
tf_grad1 = tf.gradients(func_val, in1)
with tf.Session() as sess:
# would like tf_grad1 to equal my_grad1
val, my1, tf1 = sess.run([func_val, my_grad1, tf_grad1])
tf.assert_equal(my1, tf1)
NOTE: This is similar to question How to replace or modify gradient? with one key difference: I am not interested in Tensorflow computing gradients of a different function in the backward pass; rather I would like to supply the gradients myself based on alternate tensorflow operations on the input.
I am trying to use the ideas proposed in the solution to the above question and in the following post, that is using tf.RegisterGradient and gradient_override_map to override the gradient of the identity function wrapping the forward function.
This fails because inside the registered alternate grad for identity, I have no access to the input to func_grad:
#tf.RegisterGradient("CustomGrad")
def alternate_identity_grad(op, grad):
# op.inputs[0] is the output of func(in1,in2)
# grad is of no use, because I would like to replace it with func_grad(in1,in2)
g = tf.get_default_graph()
with g.gradient_override_map({"Identity": "CustomGrad"}):
out_grad = tf.identity(input, name="Identity")
EDIT After additional research, I believe this question is similar to the following question. I managed to obtain the desired solution by combining gradient_override_map with the hack suggested here.

In TensorFlow, what is tf.identity used for?

I've seen tf.identity used in a few places, such as the official CIFAR-10 tutorial and the batch-normalization implementation on stackoverflow, but I don't see why it's necessary.
What's it used for? Can anyone give a use case or two?
One proposed answer is that it can be used for transfer between the CPU and GPU. This is not clear to me. Extension to the question, based on this: loss = tower_loss(scope) is under the GPU block, which suggests to me that all operators defined in tower_loss are mapped to the GPU. Then, at the end of tower_loss, we see total_loss = tf.identity(total_loss) before it's returned. Why? What would be the flaw with not using tf.identity here?

After some stumbling I think I've noticed a single use case that fits all the examples I've seen. If there are other use cases, please elaborate with an example.
Use case:
Suppose you'd like to run an operator every time a particular Variable is evaluated. For example, say you'd like to add one to x every time the variable y is evaluated. It might seem like this will work:
x = tf.Variable(0.0)
x_plus_1 = tf.assign_add(x, 1)
with tf.control_dependencies([x_plus_1]):
y = x
init = tf.initialize_all_variables()
with tf.Session() as session:
init.run()
for i in xrange(5):
print(y.eval())
It doesn't: it'll print 0, 0, 0, 0, 0. Instead, it seems that we need to add a new node to the graph within the control_dependencies block. So we use this trick:
x = tf.Variable(0.0)
x_plus_1 = tf.assign_add(x, 1)
with tf.control_dependencies([x_plus_1]):
y = tf.identity(x)
init = tf.initialize_all_variables()
with tf.Session() as session:
init.run()
for i in xrange(5):
print(y.eval())
This works: it prints 1, 2, 3, 4, 5.
If in the CIFAR-10 tutorial we dropped tf.identity, then loss_averages_op would never run.

tf.identity is useful when you want to explicitly transport tensor between devices (like, from GPU to a CPU).
The op adds send/recv nodes to the graph, which make a copy when the devices of the input and the output are different.
A default behavior is that the send/recv nodes are added implicitly when the operation happens on a different device but you can imagine some situations (especially in a multi-threaded/distributed settings) when it might be useful to fetch the value of the variable multiple times within a single execution of the session.run. tf.identity allows for more control with regard to when the value should be read from the source device. Possibly a more appropriate name for this op would be read.
Also, please note that in the implementation of tf.Variable link, the identity op is added in the constructor, which makes sure that all the accesses to the variable copy the data from the source only once. Multiple copies can be expensive in cases when the variable lives on a GPU but it is read by multiple CPU ops (or the other way around). Users can change the behavior with multiple calls to tf.identity when desired.
EDIT: Updated answer after the question was edited.
In addition, tf.identity can be used used as a dummy node to update a reference to the tensor. This is useful with various control flow ops. In the CIFAR case we want to enforce that the ExponentialMovingAverageOp will update relevant variables before retrieving the value of the loss. This can be implemented as:
with tf.control_dependencies([loss_averages_op]):
total_loss = tf.identity(total_loss)
Here, the tf.identity doesn't do anything useful aside of marking the total_loss tensor to be ran after evaluating loss_averages_op.

In addition to the above, I simply use it when I need to assign a name to ops that do not have a name argument, just like when initializing a state in RNN's:
rnn_cell = tf.contrib.rnn.MultiRNNCell([cells])
# no name arg
initial_state = rnn_cell.zero_state(batch_size,tf.float32)
# give it a name with tf.identity()
initial_state = tf.identity(input=initial_state,name="initial_state")

I came across another use case that is not completely covered by the other answers.
def conv_layer(input_tensor, kernel_shape, output_dim, layer_name, decay=None, act=tf.nn.relu):
"""Reusable code for making a simple convolutional layer.
"""
# Adding a name scope ensures logical grouping of the layers in the graph.
with tf.name_scope(layer_name):
# This Variable will hold the state of the weights for the layer
with tf.name_scope('weights'):
weights = weight_variable(kernel_shape, decay)
variable_summaries(weights, layer_name + '/weights')
with tf.name_scope('biases'):
biases = bias_variable([output_dim])
variable_summaries(biases, layer_name + '/biases')
with tf.name_scope('convolution'):
preactivate = tf.nn.conv2d(input_tensor, weights, strides=[1, 1, 1, 1], padding='SAME')
biased = tf.nn.bias_add(preactivate, biases)
tf.histogram_summary(layer_name + '/pre_activations', biased)
activations = act(biased, 'activation')
tf.histogram_summary(layer_name + '/activations', activations)
return activations
Most of the time when constructing a convolutional layer, you just want the activations returned so you can feed those into the next layer. Sometimes, however - for example when building an auto-encoder - you want the pre-activation values.
In this situation an elegant solution is to pass tf.identity as the activation function, effectively not activating the layer.

When our input data is serialized in bytes, and we want to extract features from this dataset. We can do so in key-value format and then get a placeholder for it. Its benefits are more realised when there are multiple features and each feature has to be read in different format.
#read the entire file in this placeholder
serialized_tf_example = tf.placeholder(tf.string, name='tf_example')
#Create a pattern in which data is to be extracted from input files
feature_configs = {'image': tf.FixedLenFeature(shape=[256], dtype=tf.float32),/
'text': tf.FixedLenFeature(shape=[128], dtype=tf.string),/
'label': tf.FixedLenFeature(shape=[128], dtype=tf.string),}
#parse the example in key: tensor dictionary
tf_example = tf.parse_example(serialized_tf_example, feature_configs)
#Create seperate placeholders operation and tensor for each feature
image = tf.identity(tf_example['image'], name='image')
text = tf.identity(tf_example['text'], name='text')
label = tf.identity(tf_example['text'], name='label')

I found another application of tf.identity in Tensorboard.
If you use tf.shuffle_batch, it returns multiple tensors at once, so you see messy picture when visualizing the graph, you can't split tensor creation pipeline from actiual input tensors: messy
But with tf.identity you can create duplicate nodes, which don't affect computation flow: nice

In distribution training, we should use tf.identity or the workers will hang at waiting for initialization of the chief worker:
vec = tf.identity(tf.nn.embedding_lookup(embedding_tbl, id)) * mask
with tf.variable_scope("BiRNN", reuse=None):
out, _ = tf.nn.bidirectional_dynamic_rnn(fw, bw, vec, sequence_length=id_sz, dtype=tf.float32)
For details, without identity, the chief worker would treat some variables as local variables inappropriately and the other workers wait for an initialization operation that can not end

I see this kind of hack to check assert:
assertion = tf.assert_equal(tf.shape(image)[-1], 3, message="image must have 3 color channels")
with tf.control_dependencies([assertion]):
image = tf.identity(image)
Also it's used just to give a name:
image = tf.identity(image, name='my_image')

Implementing batch normalization with tensorflow

I am trying to implement a batch normalization layer in tensor-flow. I am having no problem running the train step of this using tf.moments to get the mean and variance.
For test time, I'd like to set up an exponential moving average to track the mean and variance. I am trying to do it like this:
def batch_normalized_linear_layer(state_below, scope_name, n_inputs, n_outputs, stddev, wd, eps=.0001):
with tf.variable_scope(scope_name) as scope:
weight = _variable_with_weight_decay(
"weights", shape=[n_inputs, n_outputs],
stddev=stddev, wd=wd
)
act = tf.matmul(state_below, weight)
# get moments
act_mean, act_variance = tf.nn.moments(act, [0])
# get mean and variance variables
mean = _variable_on_cpu('bn_mean', [n_outputs], tf.constant_initializer(0.0))
variance = _variable_on_cpu('bn_variance', [n_outputs], tf.constant_initializer(1.0))
# assign the moments
assign_mean = mean.assign(act_mean)
assign_variance = variance.assign(act_variance)
act_bn = tf.mul((act - mean), tf.rsqrt(variance + eps), name=scope.name+"_bn")
beta = _variable_on_cpu("beta", [n_outputs], tf.constant_initializer(0.0))
gamma = _variable_on_cpu("gamma", [n_outputs], tf.constant_initializer(1.0))
bn = tf.add(tf.mul(act_bn, gamma), beta)
output = tf.nn.relu(bn, name=scope.name)
_activation_summary(output)
return output, mean, variance
Where _variable_on_cpu is defined as:
def _variable_on_cpu(name, shape, initializer):
"""Helper to create a Variable stored on CPU memory.
Args:
name: name of the variable
shape: list of ints
initializer: initializer for Variable
Returns:
Variable Tensor
"""
with tf.device('/cpu:0'):
var = tf.get_variable(name, shape, initializer=initializer)
return var
I believe that I am setting
assign_mean = mean.assign(act_mean)
assign_variance = variance.assign(act_variance)
Incorrectly, but I am not sure how. When I use tensorboard to track these mean and variance variables, they are just flat that their initialized values.

Rafal's comment gets at the core of the problem: You're not running the assign nodes. You might try using the batchnorm helper I posted in another answer - How could I use Batch Normalization in TensorFlow? - or you can force the assign to happen by adding with_dependencies, as he suggests.
The general principle is that you should only count on a node being run if data or control dependencies flow "through" it. with_dependencies ensures that before the output op is used, the specified dependencies will have completed.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.