Implementing batch normalization with tensorflow

Implementing batch normalization with tensorflow - python

I am trying to implement a batch normalization layer in tensor-flow. I am having no problem running the train step of this using tf.moments to get the mean and variance.
For test time, I'd like to set up an exponential moving average to track the mean and variance. I am trying to do it like this:
def batch_normalized_linear_layer(state_below, scope_name, n_inputs, n_outputs, stddev, wd, eps=.0001):
with tf.variable_scope(scope_name) as scope:
weight = _variable_with_weight_decay(
"weights", shape=[n_inputs, n_outputs],
stddev=stddev, wd=wd
)
act = tf.matmul(state_below, weight)
# get moments
act_mean, act_variance = tf.nn.moments(act, [0])
# get mean and variance variables
mean = _variable_on_cpu('bn_mean', [n_outputs], tf.constant_initializer(0.0))
variance = _variable_on_cpu('bn_variance', [n_outputs], tf.constant_initializer(1.0))
# assign the moments
assign_mean = mean.assign(act_mean)
assign_variance = variance.assign(act_variance)
act_bn = tf.mul((act - mean), tf.rsqrt(variance + eps), name=scope.name+"_bn")
beta = _variable_on_cpu("beta", [n_outputs], tf.constant_initializer(0.0))
gamma = _variable_on_cpu("gamma", [n_outputs], tf.constant_initializer(1.0))
bn = tf.add(tf.mul(act_bn, gamma), beta)
output = tf.nn.relu(bn, name=scope.name)
_activation_summary(output)
return output, mean, variance
Where _variable_on_cpu is defined as:
def _variable_on_cpu(name, shape, initializer):
"""Helper to create a Variable stored on CPU memory.
Args:
name: name of the variable
shape: list of ints
initializer: initializer for Variable
Returns:
Variable Tensor
"""
with tf.device('/cpu:0'):
var = tf.get_variable(name, shape, initializer=initializer)
return var
I believe that I am setting
assign_mean = mean.assign(act_mean)
assign_variance = variance.assign(act_variance)
Incorrectly, but I am not sure how. When I use tensorboard to track these mean and variance variables, they are just flat that their initialized values.

Rafal's comment gets at the core of the problem: You're not running the assign nodes. You might try using the batchnorm helper I posted in another answer - How could I use Batch Normalization in TensorFlow? - or you can force the assign to happen by adding with_dependencies, as he suggests.
The general principle is that you should only count on a node being run if data or control dependencies flow "through" it. with_dependencies ensures that before the output op is used, the specified dependencies will have completed.

Related

Constrain parameters to be -1, 0 or 1 in neural network in pytorch

I want to constrain the parameters of an intermediate layer in a neural network to prefer discrete values: -1, 0, or 1. The idea is to add a custom objective function that would increase the loss if the parameters take any other value. Note that, I want to constrain parameters of a particular layer, not all layers.
How can I implement this in pytorch? I want to add this custom loss to the total loss in the training loop, something like this:
custom_loss = constrain_parameters_to_be_discrete
loss = other_loss + custom_loss
May be using a Dirichlet prior might help, any pointer to this?

Extending upon #Shai answer and mixing it with this answer one could do it simpler via custom layer into which you could pass your specific layer.
First, the calculated derivative of torch.abs(x**2 - torch.abs(x)) taken from WolframAlpha (check here) would be placed inside regularize function.
Now the Constrainer layer:
class Constrainer(torch.nn.Module):
def __init__(self, module, weight_decay=1.0):
super().__init__()
self.module = module
self.weight_decay = weight_decay
# Backward hook is registered on the specified module
self.hook = self.module.register_full_backward_hook(self._weight_decay_hook)
# Not working with grad accumulation, check original answer and pointers there
# If that's needed
def _weight_decay_hook(self, *_):
for parameter in self.module.parameters():
parameter.grad = self.regularize(parameter)
def regularize(self, parameter):
# Derivative of the regularization term created by #Shia
sgn = torch.sign(parameter)
return self.weight_decay * (
(sgn - 2 * parameter) * torch.sign(1 - parameter * sgn)
)
def forward(self, *args, **kwargs):
# Simply forward and args and kwargs to module
return self.module(*args, **kwargs)
Usage is really simple (with your specified weight_decay hyperparameter if you need more/less force on the params):
constrained_layer = Constrainer(torch.nn.Linear(20, 10), weight_decay=0.1)
Now you don't have to worry about different loss functions and can use your model normally.

You can use the loss function:
def custom_loss_function(x):
loss = torch.abs(x**2 - torch.abs(x))
return loss.mean()
This graph plots the proposed loss for a single element:
As you can see, the proposed loss is zero for x={-1, 0, 1} and positive otherwise.
Note that if you want to apply this loss to the weights of a specific layer, then your x here are the weights, not the activations of the layer.

Overriding Apply_gradients for custom distributed training

I'm playing with an CNN architecture involving three identical networks that I train with non-overlapping datasets and then coordinate each iteration. Each weight is updated by averaging this weight with the corresponding weight in the other nest and then proportionally adding this weight's current gradient.
I'm using tensorflow 2.2.0 and keras, and I think I'm wanting to override apply_gradients to do it. My first question is Should I be overriding apply_gradients?
Secondly, I have a list of the parameters for each of the models. In apply_gradients, I have a list of gradients and the var_list that goes with it. The gradients are Tensors, and the parameters are Variables. Apply_gradients needs to return an Operation. How do I take a weighted sum (an average) of the parameter variables and then perform the standard Gradient descent, returning an Operation?
Here's my current (commented out) code for apply gradient taken out of my custom optimizer subclass:
def apply_gradients(self,
grads_and_vars,
name=None,
experimental_aggregate_gradients=True):
# Formatting grads_and_vars
grads_and_vars = _filter_grads(grads_and_vars)
var_list = [v for (_, v) in grads_and_vars]
with K.name_scope(self._name):
with ops.init_scope():
self._create_all_weights(var_list)
if not grads_and_vars:
return control_flow_ops.no_op()
strategy = distribute_ctx.get_strategy()
apply_state = self._prepare(var_list)
#Formatting done
#Here's the trouble spot, where I'm trying to update the vars
#CDSGD
grads, var_list = zip(*grads_and_vars)
grads = list(grads)
var_list = list(var_list)
opsR = []
l_r = self._get_hyper("learning_rate")
for i in range(len(grads)):
#base = var_list[i] * 0
#for j in range(3): # 3 Networks, agent_id is that particular network's id 0-2
# base += self.pi[j][self.agent_id] * parameters[j][i]
#parameters holds all the networks' parameters
#var_list[i] = var_list[i].assign(base)
opt1 = training_ops.resource_apply_gradient_descent(var_list[i].handle, l_r, grads[i], use_locking=self._use_locking)
opsR.append(opt1)
return opsR
With this commented out, it trains, but there is no collaboration. I've tried a couple other things, and they all either don't run or don't train.

Tensorflow - Access weights while doing backprop

I want to implement C-MWP as described here: https://arxiv.org/pdf/1608.00507.pdf in keras/tensorflow.
This involves modifying the way backprop is performed. The new gradient is a function of the bottom activation responses the weight parameters and the gradients of the layer above.
As a start, I was looking at the way keras-vis is doing modified backprop:
def _register_guided_gradient(name):
if name not in ops._gradient_registry._registry:
#tf.RegisterGradient(name)
def _guided_backprop(op, grad):
dtype = op.outputs[0].dtype
gate_g = tf.cast(grad > 0., dtype)
gate_y = tf.cast(op.outputs[0] > 0, dtype)
return gate_y * gate_g * grad
However, to implement C-MWP I need access to the weights of the layer on which the backprop is performed. Is it possible to access the weight within the #tf.RegisterGradient(name) function? Or am I on the wrong path?

The gradient computation in TF is fundamentally per-operation. If the operation whose gradient you want to change is performed on the weights, or at least the weights are not far from it in the operation graph, you can try finding the weights tensor by walking the graph inside your custom gradient. For example, say you have something like
x = tf.get_variable(...)
y = 5.0 * x
tf.gradients(y, x)
You can get to the variable tensor (more precisely, the tensor produced by the variable reading operation) with something like
#tf.RegisterGradient(name)
def my_grad(op, grad):
weights = op.inputs[1]
...
If the weights are not immediate inputs, but you know how to get to them, you can walk the graph a bit using something like:
#tf.RegisterGradient(name)
def my_grad(op, grad):
weights = op.inputs[1].op.inputs[0].op.inputs[2]
...
You should understand that this solution is very hacky. If you control the forward pass, you might want to just define a custom gradient just for the subgraph you care about. You can see how you can do that in How to register a custom gradient for a operation composed of tf operations
and How Can I Define Only the Gradient for a Tensorflow Subgraph? and https://www.tensorflow.org/api_docs/python/tf/Graph#gradient_override_map

Does tensorflow provide an operation like caffe average_loss operation?

due to the limiting of gpu, I want to update my weight after every two step training. Specifically, the network will firstly calculate the fisrt batch inputs and save the loss. And then the network calculate the next batch inputs and average these two losses and will update the weights once. It likes average_loss op in caffe, for example()fcn-berkeley . and how to calculate the batchnorm update-ops.

Easy, juste use tf.reduce_mean(input_tensor)
Tf documentation reduce_mean
and in your case, it will be :
loss = tf.concat([loss1,loss2], axis=0)
final_loss = tf.reduce_mean(loss, axis=0)

Please check this thread for correct info on Caffe's average_loss.
You should be able to compute an averaged loss by subclassing LoggingTensorHook in a way like
class MyLoggingTensorHook(tf.train.LoggingTensorHook):
# set every_n_iter to if you want to average last 2 losses
def __init__(self, tensors, every_n_iter):
super().__init__(tensors=tensors, every_n_iter=every_n_iter)
# keep track of previous losses
self.losses=[]
def after_run(self, run_context, run_values):
_ = run_context
# assuming you have a tag like 'average_loss'
# as the name of your loss tensor
for tag in self._tag_order:
if 'average_loss' in tag:
self.losses.append(run_values.results[tag])
if self._should_trigger:
self._log_tensors(run_values.results)
self._iter_count += 1
def _log_tensors(self, tensor_values):
original = np.get_printoptions()
np.set_printoptions(suppress=True)
logging.info("%s = %s" % ('average_loss', np.mean(self.losses)))
np.set_printoptions(**original)
self.losses=[]
and attach it to an estimator's train method or use a TrainSpec.
You should be able to compute gradients of your variables normally in every step, but apply them in every N steps by conditioning on your global_state variable that defines your current iteration or step (you should have initialized this variable in your graph by something like global_step = tf.train.get_or_create_global_step()). Please see the usage of compute_gradients and apply_gradients for this.

Computing the weight-update-ratio in Tensorflow

I'm searching for a way to compute the weight-update-ratio for optimizer steps in Tensorflow. The weight-update-ratio is defined as the update-scale divided by the variable scale in each step and can be used for inspecting network training.
Ideally I want a non-intrusive way to compute it in a single session run, but couldn't accomplish quite what I was looking for. Since the update-scale and parameter scale are independent of the train step, one needs to add explicit dependencies to the graph in order to graph variable-scale before and after the update step. Unfortunately, it seems that in TF dependencies can only be defined for new nodes, which further complicates the issue.
So far, the best I've come up with is a context manager for definining the necessary operations. Its used as follows
opt = tf.train.AdamOptimizer(1e0)
grads = tf.gradients(loss, tf.trainable_variables())
grads = list(zip(grads, tf.trainable_variables()))
with compute_weight_update_ratio('wur') as wur:
train = opt.apply_gradients(grads_and_vars=grads)
# ...
with tf.Session() as sess:
sess.run(wur.ratio)
The full code of compute_weight_update_ratio can be found below. What bugs me is that in the current state the weight-update-ratio (at least norm_before) is computed with every training step, but for performance reason I'd rather prefer to do it selectively (e.g only when summaries are computed).
Any ideas on how to improve?
#contextlib.contextmanager
def compute_weight_update_ratio(name, var_scope=None):
'''Injects training to compute weight-update-ratio.
The weight-update-ratio is computed as the update scale divided
by the variable scale before the update and should be somewhere in the
range 1e-2 or 1e-3.
Params
------
name : str
Operation name
Kwargs
------
var_scope : str, optional
Name selection of variables to compute weight-update-ration for. Defaults to all. Regex supported.
'''
class WeightUpdateRatio:
def __init__(self):
self.num_train = len(tf.get_collection(tf.GraphKeys.TRAIN_OP))
self.variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=var_scope)
self.norm_before = tf.norm(self.variables, name='norm_before')
def compute_ratio(self,):
train_ops = tf.get_collection(tf.GraphKeys.TRAIN_OP)
assert len(train_ops) > self.num_train, 'Missing training op'
with tf.control_dependencies(train_ops[self.num_train:]):
self.norm_after = tf.norm(self.variables, name='norm_after')
absdiff = tf.abs(tf.subtract(self.norm_after, self.norm_before), name='absdiff')
self.ratio = tf.divide(absdiff, self.norm_before, name=name)
with tf.name_scope(name) as scope:
try:
wur = WeightUpdateRatio()
with tf.control_dependencies([wur.norm_before]):
yield wur
finally:
wur.compute_ratio()

You don't need to worry about performance too much. Tensorflow only executes the subgraph necessary to produce the output.
So, in your training loop, if wur.ratio is not called during an iteration, none of the extra nodes created to compute it will be executed.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Implementing batch normalization with tensorflow - python

Related

Constrain parameters to be -1, 0 or 1 in neural network in pytorch

Overriding Apply_gradients for custom distributed training

Tensorflow - Access weights while doing backprop

Does tensorflow provide an operation like caffe average_loss operation?

Computing the weight-update-ratio in Tensorflow

Categories

Resources