Is there a way to overide the backward operation on nn.Module - python

I am looking for a nice way of overriding the backward operation in nn.Module for example:
class LayerWithCustomGrad(nn.Module):
def __init__(self):
super(LayerWithCustomGrad, self).__init__()
self.weights = nn.Parameter(torch.randn(200))
def forward(self,x):
return x * self.weights
def backward(self,grad_of_c): # This gets called during loss.backward()
# grad_of_c comes from the gradient of b*23
grad_of_a = some_operation(grad_of_c)
# perform extra computation
# and more computation
self.weights.grad = another_operation(grad_of_a,grad_of_c)
return grad_of_a # and the grad of parameter "a" will receive this
layer = LayerWithCustomGrad()
a = nn.Parameter(torch.randn(200),requires_grad=True)
b = layer(a)
c = b*23
Some of the projects I work on contains layers with non differentiable functions, I will love it if there is some how away to connect two broken graphs and/or modify gradients of graphs that already exist.
It will also be great if there is a possible method of doing it in tensor flow

The way PyTorch is built you should first implement a custom torch.autograd.Function which will contain the forward and backward pass for your layer. Then you can create a nn.Module to wrap this function with the necessary parameters.
In this tutorial page you can see the ReLU being implemented. I will show here are to build a torch.autograd.Function and its nn.Module wrapper.
class F(torch.autograd.Function):
"""Both forward and backward are static methods."""
def forward(ctx, input, weights):
In the forward pass we receive a Tensor containing the input and return
a Tensor containing the output. ctx is a context object that can be used
to stash information for backward computation. You can cache arbitrary
objects for use in the backward pass using the ctx.save_for_backward method.
ctx.save_for_backward(input, weights)
return input*weights
def backward(ctx, grad_output):
In the backward pass we receive a Tensor containing the gradient of the loss
with respect to the output, and we need to compute the gradient of the loss
with respect to the inputs: here input and weights
input, weights = ctx.saved_tensors
grad_input = weights.clone()*grad_output
grad_weights = input.clone()*grad_output
return grad_input, grad_weights
The nn.Module will initialize the parameters and call F to handle the actual operation computation for forward/backward pass.
class LayerWithCustomGrad(nn.Module):
def __init__(self):
self.weights = nn.Parameter(torch.rand(10))
self.fn = F.apply
def forward(self, x):
return self.fn(x, self.weights)
Now we can try to infer and backpropagate:
>>> layer = LayerWithCustomGrad()
>>> x = torch.randn(10, requires_grad=True)
>>> y = layer(x)
tensor([ 0.2023, 0.7176, 0.3577, -1.3573, 1.5185, 0.0632, 0.1210, 0.1566,
0.0709, -0.4324], grad_fn=<FBackward>)
Notice the <FBackward> as grad_fn: this is the backward function of F bound to the previous inference we made with x.
>>> y.mean().backward()
>>> x.grad # i.e. grad_input in F.backward
tensor([0.0141, 0.0852, 0.0450, 0.0922, 0.0400, 0.0988, 0.0762, 0.0227, 0.0569,
>>> layer.weights.grad # i.e. grad_weights in F.backward
tensor([-1.4584, -2.1187, 1.5991, 0.9764, 1.8956, -1.0993, -3.7835, -0.4926,
0.9477, -1.2219])


Python OOP issue: How a function in a class is getting input value, when it is not mentioned in __init__() function as a parameter?

I am relatively new to object oriented programming and I am currently working on Generative Adversarial Networks project. I came across maxout activation function. The function is defined through maxout class. See the code below:
class maxout(torch.nn.Module):
def __init__(self, num_pieces):
super(maxout, self).__init__()
self.num_pieces = num_pieces
def forward(self, x):
assert x.shape[1] % self.num_pieces == 0 # 625 % 5 = 0
ret = x.view(*x.shape[:1], # batch_size
x.shape[1] // self.num_pieces,
self.num_pieces, # num_pieces
*x.shape[2:] )
ret, _ = ret.max(dim=2)
return ret
This maxout function was later on used in a discriminator class. Following is the code for discriminator class.
class discriminator(torch.nn.Module):
def __init__(self):
super(discriminator, self).__init__()
self.fcn = torch.nn.Sequential(
# Fully connected layer 1
in_features = 784,
bias = True
# Fully connected layer 2
in_features = 48,
bias = True
) )
def forward(self, batch):
inputs = batch.view(batch.size(0), -1)
outputs = self.fcn(inputs)
outputs = outputs.mean(0)
return outputs.view(1) # it will return a single value
The code is working fine but as per my naive understanding of Object oriented programming, the value of 'x' in forward() function in maxout class should be provided through init() function.
My question is: How input 'x' is being received by forward() function of maxout class,with out getting input through init() function.
Another way to put this question is: How output of Linear layer in discriminator class is passed to maxout function as 'x'?
You are passing layers to the constructior of the Sequential model, which is assigned to self.fcn, than you in discriminator.forward are calling this moddel. It's __call__ method than calls all the forward functions of the layers it contains.
You can imagine something like this is going on
def forward(self, batch):
return torch.nn.Linear(
in_features = 48,
bias = True
in_features = 784,
bias = True
).forward(batch.view(batch.size(0), -1)
Methods differ from regular functions only by the fact that their first argument is the object of the class in which method is defined, and this first argument can be supplied with the dot operator.
Only the programmer decides which values to pass through __init__ to assign them to object attributes, and which to pass as parameteres following the first, self, argument. There is no requirment to pass all values that we are giving to other methods through the __init__ method first.
To answer your question:
torch.nn.Sequential saves the list of functions you pass, in this example it is of length 3, and outputs some callable object (an object that can behave like a function). You save it to self.fcn. Then, when you call self.fcn, it calls the forward method of the first torch.nn.Linear, then passes it's return value to the forward method of the maxout(5) object - that's where the x is given! - and then passes the output of the maxout(5) object to the forward method of the second torch.nn.Linear and returns the result.

How to create a custom PreprocessingLayer in TF 2.2

I would like to create a custom preprocessing layer using the tf.keras.layers.experimental.preprocessing.PreprocessingLayer layer.
In this custom layer, placed after the input layer, I would like to normalize my image using tf.cast(img, tf.float32) / 255.
I tried to find some code or example showing how to create this preprocessing layer, but I couldn't find.
Please, can someone provide a full example creating and using the PreprocessingLayer layer ?
If you want to have a custom preprocessing layer, actually you don't need to use PreprocessingLayer. You can simply subclass Layer
Take the simplest preprocessing layer Rescaling as an example, it is under the tf.keras.layers.experimental.preprocessing.Rescaling namespace. However, if you check the actual implementation, it is just subclass Layer class Source Code Link Here but has #keras_export('keras.layers.experimental.preprocessing.Rescaling')
class Rescaling(Layer):
"""Multiply inputs by `scale` and adds `offset`.
For instance:
1. To rescale an input in the `[0, 255]` range
to be in the `[0, 1]` range, you would pass `scale=1./255`.
2. To rescale an input in the `[0, 255]` range to be in the `[-1, 1]` range,
you would pass `scale=1./127.5, offset=-1`.
The rescaling is applied both during training and inference.
Input shape:
Output shape:
Same as input.
scale: Float, the scale to apply to the inputs.
offset: Float, the offset to apply to the inputs.
name: A string, the name of the layer.
def __init__(self, scale, offset=0., name=None, **kwargs):
self.scale = scale
self.offset = offset
super(Rescaling, self).__init__(name=name, **kwargs)
def call(self, inputs):
dtype = self._compute_dtype
scale = math_ops.cast(self.scale, dtype)
offset = math_ops.cast(self.offset, dtype)
return math_ops.cast(inputs, dtype) * scale + offset
def compute_output_shape(self, input_shape):
return input_shape
def get_config(self):
config = {
'scale': self.scale,
'offset': self.offset,
base_config = super(Rescaling, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
So it proves that Rescaling preprocessing is just another normal layer.
The main part is the def call(self, inputs) function. You can create whatever complicated logic to preprocess your inputs and then return.
A easier documentation about custom layer can be find here
In a nutshell, you can do the preprocessing by layer, either by Lambda which for simple operation or by subclassing Layer to achieve your goal.
I think that the best and cleaner solution to do this is using a simple Lambda layer where you can wrap your pre-processing function
this is a dummy working example
import numpy as np
from tensorflow.keras.layers import *
from tensorflow.keras.models import *
X = np.random.randint(0,256, (200,32,32,3))
y = np.random.randint(0,3, 200)
inp = Input((32,32,3))
x = Lambda(lambda x: x/255)(inp)
x = Conv2D(8, 3, activation='relu')(x)
x = Flatten()(x)
out = Dense(3, activation='softmax')(x)
m = Model(inp, out)
history =, y, epochs=10)

Can my PyTorch forward function do additional operations?

Typically a forward function strings together a bunch of layers and returns the output of the last one. Can I do some additional processing after that last layer before returning? For example, some scalar multiplication and reshaping via .view?
I know that the autograd somehow figures out gradients. So I don’t know if my additional processing will somehow screw that up. Thanks.
pytorch tracks the gradients via the computational graph of the tensors, not through the functions. As long as your tensors has requires_grad=True property and their grad is not None you can do (almost) whatever you like and still be able to backprop.
As long as you are using pytorch's operations (e.g., those listed in here and here) you should be okay.
For more info see this.
For example (taken from torchvision's VGG implementation):
class VGG(nn.Module):
def __init__(self, features, num_classes=1000, init_weights=True):
super(VGG, self).__init__()
# ...
def forward(self, x):
x = self.features(x)
x = self.avgpool(x)
x = torch.flatten(x, 1) # <-- what you were asking about
x = self.classifier(x)
return x
A more complex example can be seen in torchvision's implementation of ResNet:
class Bottleneck(nn.Module):
def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
base_width=64, dilation=1, norm_layer=None):
super(Bottleneck, self).__init__()
# ...
def forward(self, x):
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
out = self.relu(out)
out = self.conv3(out)
out = self.bn3(out)
if self.downsample is not None: # <-- conditional execution!
identity = self.downsample(x)
out += identity # <-- inplace operations
out = self.relu(out)
return out

Short circuit computation in mixture of experts model using tensorflow keras functional api

I am trying to swap between multiple different "expert" layers based on the output of a "gating" layer (as a mixture of experts).
I created a custom layer that takes in the outputs of the expert and gating layers, but this ends up throwing away some outputs rather than not computing them in the first place.
How can I make the model "short circuit" to only evaluate the gating layer and the selected expert layer(s) to save computation time?
I am using tensorflow 2.0 gpu and the keras functional api
Keras models can be implemented fully dynamically, to support the efficient routing that you mentioned. The following example shows one way in which this can be done. The example is written with the following premises:
It assumes there are two experts (LayerA and LayerB)
It assumes that a mix-of-experts model (MixOfExpertsModel) switches dynamically between the two expert layer classes depending on the per-example output of a Keras Dense layer
It satisfies the need to run training on the model in a batch fashion.
Pay attention to the comments in the code to see how the switching is done.
import numpy as np
import tensorflow as tf
# This is your Expert A class.
class LayerA(tf.keras.layers.Layer):
def build(self, input_shape):
self.weight = self.add_weight("weight_a", shape=input_shape[1:])
def call(self, x):
return x + self.weight
# This is your Expert B class.
class LayerB(tf.keras.layers.Layer):
def build(self, input_shape):
self.weight = self.add_weight("weight_b", shape=input_shape[1:])
def call(self, x):
return x * self.weight
class MixOfExpertsModel(tf.keras.models.Model):
def __init__(self):
super(MixOfExpertsModel, self).__init__()
self._expert_a = LayerA()
self._expert_b = LayerB()
self._gating_layer = tf.keras.layers.Dense(1, activation="sigmoid")
def call(self, x):
z = self._gating_layer(x)
# The switching logic:
# - examples with gating output <= 0.5 are routed to expert A
# - examples with gating output > 0.5 are routed to expert B.
mask_a = tf.squeeze(tf.less_equal(z, 0.5), axis=-1)
mask_b = tf.squeeze(tf.greater(z, 0.5), axis=-1)
# `input_a` is a subset of slices of the original input (`x`).
# So is `input_b`. As such, no compute is wasted.
input_a = tf.boolean_mask(x, mask_a, axis=0)
input_b = tf.boolean_mask(x, mask_b, axis=0)
if tf.size(input_a) > 0:
output_a = self._expert_a(input_a)
output_a = tf.zeros_like(input_a)
if tf.size(input_b) > 0:
output_b = self._expert_b(input_b)
output_b = tf.zeros_like(input_b)
# Return `mask_a`, and `mask_b`, so that the caller can know
# which example is routed to which expert and whether its output
# appears in `output_a` or `output_b`. # This is necessary
# for writing a (custom) loss function for this class.
return output_a, output_b, mask_a, mask_b
# Create an intance of the mix-of-experts model.
mix_of_experts_model = MixOfExpertsModel()
# Generate some dummy data.
num_examples = 32
xs = np.random.random([num_examples, 8]).astype(np.float32)
# Call the model.
I didn't write a custom loss function that would support the training of this class. But that's doable by using the return values of, namely the outputs and masks.

Moving member tensors with in PyTorch

I am building a Variational Autoencoder (VAE) in PyTorch and have a problem writing device agnostic code. The Autoencoder is a child of nn.Module with an encoder and decoder network, which are too. All weights of the network can be moved from one device to another by calling
The problem I have is with the reparametrization trick:
encoding = mu + noise * sigma
The noise is a tensor of the same size as mu and sigma and saved as a member variable of the autoencoder module. It is initialized in the constructor and resampled in-place each training step. I do it that way to avoid constructing a new noise tensor each step and pushing it to the desired device. Additionally, I want to fix the noise in the evaluation. Here is the code:
class VariationalGenerator(nn.Module):
def __init__(self, input_nc, output_nc):
super(VariationalGenerator, self).__init__()
self.input_nc = input_nc
self.output_nc = output_nc
embedding_size = 128
self._train_noise = torch.randn(batch_size, embedding_size)
self._eval_noise = torch.randn(1, embedding_size)
self.noise = self._train_noise
# Create encoder
self.encoder = Encoder(input_nc, embedding_size)
# Create decoder
self.decoder = Decoder(output_nc, embedding_size)
def train(self, mode=True):
super(VariationalGenerator, self).train(mode)
self.noise = self._train_noise
def eval(self):
super(VariationalGenerator, self).eval()
self.noise = self._eval_noise
def forward(self, inputs):
# Calculate parameters of embedding space
mu, log_sigma = self.encoder.forward(inputs)
# Resample noise if training
# Reparametrize noise to embedding space
inputs = mu + self.noise * torch.exp(0.5 * log_sigma)
# Decode to image
inputs = self.decoder(inputs)
return inputs, mu, log_sigma
When I now move the autoencoder to the GPU with'cuda:0') I get an error in forwarding because the noise tensor is not moved.
I don't want to add a device parameter to the constructor, because then it is still not possible to move it to another device later. I also tried to wrap the noise into nn.Parameter so that it is affected by, but that gives an error from the optimizer, as the noise is flagged as requires_grad=False.
Anyone has a solution to move all of the modules with
A better version of tilman151's second approach is probably to override _apply, rather than to. That way net.cuda(), net.float(), etc will all work as well, since those all call _apply rather than to (as can be seen in the source, which is simpler than you might think):
def _apply(self, fn):
super(VariationalGenerator, self)._apply(fn)
self._train_noise = fn(self._train_noise)
self._eval_noise = fn(self._eval_noise)
return self
After some more trial and error I found two methods:
Use Buffers: By replacing self._train_noise = torch.randn(batch_size, embedding_size) with self.register_buffer('_train_noise', torch.randn(batch_size, embedding_size) the noise tensor is added to the module as a buffer. This lets affect it, too. Additionally the tensor is now part of the state_dict.
Override Using this the noise stays out of the state_dict.
def to(device):
new_self = super(VariationalGenerator, self).to(device)
new_self._train_noise =
new_self._eval_noise =
return new_self
By using this you may apply the same arguments to your tensors and the module
def to(self, **kwargs):
module = super(VariationalGenerator, self).to(**kwargs)
module._train_noise =**kwargs)
module._eval_noise =**kwargs)
return module
Use this:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
Now for both the model and every tensor you use
input =
You can use nn.Module buffers and parameters - both are considered when calling .to(device) and moved to the device.
Parameters are being updated by optimizer (so they need requires_grad=True), buffers are not.
So in your case, I'd write constructor as:
def __init__(self, input_nc, output_nc):
super(VariationalGenerator, self).__init__()
self.input_nc = input_nc
self.output_nc = output_nc
embedding_size = 128
self.register_buffer('_train_noise', torch.randn(batch_size, embedding_size))
self.register_buffer('_eval_noise', torch.randn(1, embedding_size))
self.noise = self._train_noise
# Create encoder
self.encoder = Encoder(input_nc, embedding_size)
# Create decoder
self.decoder = Decoder(output_nc, embedding_size)

