Applying non-torch function on loss before calling backward()? - python

I want to apply a custom non-torch function on the final calculated loss before computing the gradients (calling backward()). An example would be to replace the torch.mean() on the loss vector with a custom pythonic, non-torch mean function. But doing so will break the computation graph. I can not rewrite the custom mean function using torch operators and I am at a loss as how to do this. Any suggestions?

In pytorch you can easily do this by inheriting from torch.autograd.Function: All you need to do is implement your custom forward() and the corresponding backward() methods. Because I don't know the function you intend to write, I'll demonstrate it by implementing the sine function in a way that works with the automatic differentiation. Note that you need to have a method to compute the derivative of your function with respect to its input to implement the backward pass.
import torch
class MySin(torch.autograd.Function):
def forward(ctx, inp):
""" compute forward pass of custom function """
ctx.save_for_backward(inp) # save activation for backward pass
return inp.sin() # compute forward pass, can also be computed by any other library
def backward(ctx, grad_out):
""" compute product of output gradient with the
jacobian of your function evaluated at input """
inp, = ctx.saved_tensors
grad_inp = grad_out * torch.cos(inp) # propagate gradient, can also be computed by any other library
return grad_inp
To use it you can use the function sin = MySin.apply on your input.
There is also another example worked out in the documentation.


Custom reduction of losses within each batch in Keras

I am using keras for tensorflow in Python. I have a custom loss function that returns a single number for each sample in a batch (so a vector with length = batch size). How can I also specify a custom reduction method to aggregate these sample losses into a single loss for the entire batch? Is it acceptable to include this reduction within the custom loss function and have this function return just a single scalar rather than a vector of losses?
It really depends on your application and goal. A very common approach is to perform a reduce_mean over the loss generated on batch size. Some also use reduce_sum, which of course makes the loss value to depend on the batch size. A general (and maybe unnecessarily complicated) approach could be to use a function to call your desired function, which reduces the batch loss to a single value. Let's call it reducer. In your loss function, in the last line, you can call it right before return:
class my_loss(keras.losses.Loss):
def __init__(self, inputs)
# a bunch of assignments
self.reducer = self._get_reducer_function(inputs) (or a normal mean function)
def call(self, y_true, y_pred):
y_batch = ....
return self.reducer(y_batch)
def get_config(self):
return {'input': 1}
Of course you don't need to write so complicated, but it should give you an idea of how to do it. Also, you can simply add sample_weights if you need.

math range error for apply sigmoid activation function to implement neural network algorithm

I use exp() function in math package under the python and I implement function in order to apply sigmoid function. Below is the source code:
import math
def transfer(self, actavation):
return 1.0/(1.0 + exp(-actavation))
And I use this function with dataset to parameter in loop
dataset = [[2.7810836,2.550537003,0],
I get an OverflowError:math range error
I know why this problem occures. because it has too many calculate processes.
But I have to get a result. what if use rounding down?
how can I solve this problem??

How to apply Optimizer on Variable in Chainer?

Here is an example in Pytorch:
optimizer = optim.Adam([modifier_var], lr=0.0005)
And here in Tensorflow:
self.train = self.optimizer.minimize(self.loss, var_list=[self.modifier])
But Chainer's optimizers only can use on 'Link', how can I apply Optimizer on Variable in Chainer?
In short, there is no way to directly assign chainer.Variable (even nor chainer.Parameter) to chainer.Optimizer.
The following is some redundant explanation.
First, I re-define Variable and Parameter to avoid confusion.
Variable is (1) torch.Tensor in PyTorch v4, (2) torch.autograd.Variable in PyTorch v3, and (3) chainer.Variable in Chainer v4.
Variable is an object who holds two tensors; .data and .grad. It is the necessary and sufficient condition, so Variable is not necessarily a learnable parameter, which is a target of the optimizer.
In both libraries, there is another class Parameter, which is similar but not the same with Variable. Parameter is torch.autograd.Parameter in Pytorch and chainer.Parameter in Chainer.
Parameter must be a learnable parameter and should be optimized.
Therefore, there should be no case to register Variable (not Parameter) to Optimizer (although PyTorch allows to register Variable to Optimizer: this is just for backward compatibility).
Second, in PyTorch torch.nn.Optimizer directly optimizes Parameter, but in Chainer chainer.Optimizer DOES NOT optimize Parameter: instead, chainer.UpdateRule does. The Optimizer just registers UpdateRules to Parameters in a Link.
Therefore, it is only natural that chainer.Optimizer does not receive Parameter as its arguments, because it is just a "delivery-man" of UpdateRule.
If you want to attach different UpdateRule for each Parameter, you should directly create an instance of UpdateRule subclass, and attach it to the Parameter.
Below is an example to learn regression task by MyChain MLP model using Adam optimizer in Chainer.
from chainer import Chain, Variable
# Prepare your model (neural network) as `Link` or `Chain`
class MyChain(Chain):
def __init__(self):
super(MyChain, self).__init__(
l1=L.Linear(None, 30),
l2=L.Linear(None, 30),
l3=L.Linear(None, 1)
def __call__(self, x):
h = self.l1(x)
h = self.l2(F.sigmoid(h))
return self.l3(F.sigmoid(h))
model = MyChain()
# Then you can instantiate optimizer
optimizer = chainer.optimizers.Adam()
# Register model to optimizer (to indicate which parameter to update)
# Calculate loss, and update parameter as follows.
def lossfun(x, y):
loss = F.mean_squared_error(model(x), y)
return loss
# this iteration is "training", to fit the model into desired function.
for i in range(300):
optimizer.update(lossfun, x, y)
So in summary, you need to setup the model, after that you can use update function to calculate loss and update model's parameter.
The above code comes from here
Also, there are other way to write training code using Trainer module. For more detailed tutorial of Chainer, please refer below

How can I apply custom regularization in CNTK (using python)?

do you know how can I apply a custom regularization function to CNTK?
In particular, I would like to add to the loss the derivative of the functino wrt to the inputs; something like
newLoss = loss + lambda * gradient_F(inputs)
where F is the function learned by the model and inputs are the inputs to the model.
How can I achieve this in CNTK? I don't know how to access the gradients wrt to the inputs, and how to take the gradient wrt to the weights of the regularizer.
First, gradient is not a scalar, so it doesn't make a lot of sense to optimize it. The gradient norm might be an interesting thing to add to your loss. To do that, CNTK would have to take the gradient of the gradient norm, which at the time of this writing (July 2017) is not supported. It is however an important feature we want to add in the next few months.
Update: One workaround is to do something like this
noisy_inputs = x + C.random.normal_like(x, scale=0.01)
noisy_model = model.clone('share', {x: noisy_inputs})
auxiliary_loss = C.squared_error(model, noisy_model)
but you will have to tune the scale of the noise for your problem.
CNTK learners only accept numbers as regularizer (L1/L2) values. If you really want to add your custom regularizer, you can easily implement your own Learner. You will have access to the gradients you need. You will find couple of examples on how to implement your own Learner here.
Here's the code to do this:
def cross_entropy_with_softmax_plus_regularization(model, labels, l2_regularization_weight):
w_norm = C.Constant(0);
for p in (model.parameters):
w_norm =, 0.5*C.reduce_sum(C.square(p)))
return C.reduce_log_sum_exp(model.output) -
C.reduce_log_sum_exp(C.times_transpose(labels, model.output)) + l2_regularization_weight*w_norm
and my blog post about it

How to create an optimizer in Tensorflow

I want to write a new optimization algorithm for my network on Tensorflow. I hope to implement the Levenberg Marquardt optimization algorithm, which now is excluded from TF API. I found poor documentation on how to write a custom optimizer, so i ask if someone can give my any advice. Thanks.
The simplest example of an optimizer is probably the gradient descent optimizer. It shows how one creates an instance of the basic optimizer class. The optimizer base class documentation explains what the methods do.
The python side of the optimizers adds new nodes to the graph that compute and apply the gradients being back-propagated. It supplies the parameters that get passed to the ops and does some of the high-level management of the optimizer. Then, you need the actual "Apply" op.
Ops have both a python and a C++ component. Writing a training op is the same (but specialized) as the general process of adding an Op to TensorFlow.
For an example set of training ops that compute and apply gradients, see
python/training/ - this is the Python glue for the actual training ops. Note that the code here is mostly about shape inference - the computation is going to be in the C++.
The actual math for applying the gradients is handled by an Op (recalling that, in general, ops are written in C++). In this case, the apply gradients ops are defined in core/kernels/ You can see, for example, the implementation of ApplyGradientDescentOp in there, which references a functor ApplyGradientDescent:
var.device(d) -= grad * lr();
The implementation of the Op itself follows the implementation of any other op as described in the adding-an-op docs.
Before running the Tensorflow Session, one should initiate an Optimizer as seen below:
# Gradient Descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.train.GradientDescentOptimizer is an object of the class GradientDescentOptimizer and as the name says, it implements the gradient descent algorithm.
The method minimize() is being called with a “cost” as parameter and consists of the two methods compute_gradients() and then apply_gradients().
For most (custom) optimizer implementations, the method apply_gradients() needs to be adapted.
This method relies on the (new) Optimizer (class), which we will create, to implement the following methods: _create_slots(), _prepare(), _apply_dense(), and _apply_sparse().
_create_slots() and _prepare() create and initialise additional
variables, such as momentum.
_apply_dense(), and _apply_sparse() implement the actual Ops, which update the variables.
Ops are generally written in C++ . Without having to change the C++ header yourself, you can still return a python wrapper of some Ops through these methods.
This is done as follows:
def _create_slots(self, var_list):
# Create slots for allocation and later management of additional
# variables associated with the variables to train.
# for example: the first and second moments.
for v in var_list:
self._zeros_slot(v, "m", self._name)
self._zeros_slot(v, "v", self._name)
def _apply_dense(self, grad, var):
#define your favourite variable update
# for example:
# Here we apply gradient descents by substracting the variables
# with the gradient times the learning_rate (defined in __init__)
var_update = state_ops.assign_sub(var, self.learning_rate * grad)
#The trick is now to pass the Ops in the control_flow_ops and
# eventually groups any particular computation of the slots your
# wish to keep track of:
# for example:
m_t = ...m... #do something with m and grad
v_t = ...v... # do something with v and grad
return*[var_update, m_t, v_t])
For a more detailed explanation with example, see this blog post

