Typically a forward function strings together a bunch of layers and returns the output of the last one. Can I do some additional processing after that last layer before returning? For example, some scalar multiplication and reshaping via .view?
I know that the autograd somehow figures out gradients. So I don’t know if my additional processing will somehow screw that up. Thanks.
pytorch tracks the gradients via the computational graph of the tensors, not through the functions. As long as your tensors has requires_grad=True property and their grad is not None you can do (almost) whatever you like and still be able to backprop.
As long as you are using pytorch's operations (e.g., those listed in here and here) you should be okay.
For more info see this.
For example (taken from torchvision's VGG implementation):
class VGG(nn.Module):
def __init__(self, features, num_classes=1000, init_weights=True):
super(VGG, self).__init__()
# ...
def forward(self, x):
x = self.features(x)
x = self.avgpool(x)
x = torch.flatten(x, 1) # <-- what you were asking about
x = self.classifier(x)
return x
A more complex example can be seen in torchvision's implementation of ResNet:
class Bottleneck(nn.Module):
def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
base_width=64, dilation=1, norm_layer=None):
super(Bottleneck, self).__init__()
# ...
def forward(self, x):
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
out = self.relu(out)
out = self.conv3(out)
out = self.bn3(out)
if self.downsample is not None: # <-- conditional execution!
identity = self.downsample(x)
out += identity # <-- inplace operations
out = self.relu(out)
return out
Related
I am looking for a nice way of overriding the backward operation in nn.Module for example:
class LayerWithCustomGrad(nn.Module):
def __init__(self):
super(LayerWithCustomGrad, self).__init__()
self.weights = nn.Parameter(torch.randn(200))
def forward(self,x):
return x * self.weights
def backward(self,grad_of_c): # This gets called during loss.backward()
# grad_of_c comes from the gradient of b*23
grad_of_a = some_operation(grad_of_c)
# perform extra computation
# and more computation
self.weights.grad = another_operation(grad_of_a,grad_of_c)
return grad_of_a # and the grad of parameter "a" will receive this
layer = LayerWithCustomGrad()
a = nn.Parameter(torch.randn(200),requires_grad=True)
b = layer(a)
c = b*23
Some of the projects I work on contains layers with non differentiable functions, I will love it if there is some how away to connect two broken graphs and/or modify gradients of graphs that already exist.
It will also be great if there is a possible method of doing it in tensor flow
The way PyTorch is built you should first implement a custom torch.autograd.Function which will contain the forward and backward pass for your layer. Then you can create a nn.Module to wrap this function with the necessary parameters.
In this tutorial page you can see the ReLU being implemented. I will show here are to build a torch.autograd.Function and its nn.Module wrapper.
class F(torch.autograd.Function):
"""Both forward and backward are static methods."""
#staticmethod
def forward(ctx, input, weights):
"""
In the forward pass we receive a Tensor containing the input and return
a Tensor containing the output. ctx is a context object that can be used
to stash information for backward computation. You can cache arbitrary
objects for use in the backward pass using the ctx.save_for_backward method.
"""
ctx.save_for_backward(input, weights)
return input*weights
#staticmethod
def backward(ctx, grad_output):
"""
In the backward pass we receive a Tensor containing the gradient of the loss
with respect to the output, and we need to compute the gradient of the loss
with respect to the inputs: here input and weights
"""
input, weights = ctx.saved_tensors
grad_input = weights.clone()*grad_output
grad_weights = input.clone()*grad_output
return grad_input, grad_weights
The nn.Module will initialize the parameters and call F to handle the actual operation computation for forward/backward pass.
class LayerWithCustomGrad(nn.Module):
def __init__(self):
super().__init__()
self.weights = nn.Parameter(torch.rand(10))
self.fn = F.apply
def forward(self, x):
return self.fn(x, self.weights)
Now we can try to infer and backpropagate:
>>> layer = LayerWithCustomGrad()
>>> x = torch.randn(10, requires_grad=True)
>>> y = layer(x)
tensor([ 0.2023, 0.7176, 0.3577, -1.3573, 1.5185, 0.0632, 0.1210, 0.1566,
0.0709, -0.4324], grad_fn=<FBackward>)
Notice the <FBackward> as grad_fn: this is the backward function of F bound to the previous inference we made with x.
>>> y.mean().backward()
>>> x.grad # i.e. grad_input in F.backward
tensor([0.0141, 0.0852, 0.0450, 0.0922, 0.0400, 0.0988, 0.0762, 0.0227, 0.0569,
0.0309])
>>> layer.weights.grad # i.e. grad_weights in F.backward
tensor([-1.4584, -2.1187, 1.5991, 0.9764, 1.8956, -1.0993, -3.7835, -0.4926,
0.9477, -1.2219])
I am trying to compute the Hessian of the output of a neural network with respect to its inputs. To give you an idea, this is the matrix I am trying to compute:
I am running Tensorflow 2.5.0 and my code to calculate the M matrix looks like this:
def get_Mass_Matrix(self, q, dq):
nDof = dq.shape[1]
with tf.GradientTape(persistent = True) as t2:
t2.watch(dq)
with tf.GradientTape(persistent = True) as t1:
t1.watch(dq)
T = self.kinetic(q, dq)
g = t1.gradient(T, dq)
h = t2.batch_jacobian(g, dq)
return h
The function self.kinetic() calls a multilayer perceptron. When I compute M like this, I get the correct answer but my neural network training slows down significantly, even when running on a GPU.
I was wondering if there is a more efficient way to perform the same computation that doesn't result in so much overhead? Thank you.
For reference, I am using the subclassing approach to build the model (it inherits from tf.keras.Model).
Edit:
Adding more details about the self.kinetic function:
def kinetic(self, q, qdot):
nDof = q.shape[1]
qdq = tf.concat([tf.reshape(q, ((-1, nDof))),
tf.reshape(qdot, ((-1, nDof)))], axis = -1)
return self.T_layers(qdq)
T_layers is defined as:
self.T_layers = L(nlayers = 4, n = 8, input_dim = (latent_dim, 1), nlact = 'swish', oact = 'linear')
Which is calling:
class L(tf.keras.layers.Layer):
def __init__(self, nlayers, n, nlact, input_dim, oact = 'linear'):
super(L, self).__init__()
self.layers = nlayers
self.dense_in = tf.keras.layers.Dense(n, activation = nlact, input_shape = input_dim)
self.dense_lays = []
for lay in range(nlayers):
self.dense_lays.append(tf.keras.layers.Dense(n, activation = nlact, kernel_regularizer = 'l1'))
self.dense_out = tf.keras.layers.Dense(1, activation = oact, use_bias = False)
def call(self, inputs):
x = self.dense_in(inputs)
for lay in range(self.layers):
x = self.dense_lays[lay](x)
return self.dense_out(x)
I suspect part of the problem might be that I am not "building" the layers? Any advice is appreciated!
In order to get a reasonable performance from tensorflow, especially when computing gradients, you have to decorate your get_Mass_Matrix with #tf.function to make sure it runs in graph mode. To do this, everything inside the function have to be graph-mode compatible.
In the call function of class L, it is better to iterate the list directly instead of indexing it, i.e.:
class L(tf.keras.layers.Layer):
...
def call(self, inputs):
x = self.dense_in(inputs)
for l in self.dense_lays:
x = l(x)
return self.dense_out(x)
Then, you can decorate your get_Mass_Matrix.
#tf.function
def get_Mass_Matrix(self, q, dq):
with tf.GradientTape() as t2:
t2.watch(dq)
with tf.GradientTape() as t1:
t1.watch(dq)
T = self.kinetic(q, dq)
g = t1.gradient(T, dq)
return t2.batch_jacobian(g, dq)
Remark: q and dq that are passed into get_Mass_Matrix must be tensors of constant shape(constant between calls), otherwise, it will retrace every time there is a new shape and slow down instead.
I would like to create a custom preprocessing layer using the tf.keras.layers.experimental.preprocessing.PreprocessingLayer layer.
In this custom layer, placed after the input layer, I would like to normalize my image using tf.cast(img, tf.float32) / 255.
I tried to find some code or example showing how to create this preprocessing layer, but I couldn't find.
Please, can someone provide a full example creating and using the PreprocessingLayer layer ?
If you want to have a custom preprocessing layer, actually you don't need to use PreprocessingLayer. You can simply subclass Layer
Take the simplest preprocessing layer Rescaling as an example, it is under the tf.keras.layers.experimental.preprocessing.Rescaling namespace. However, if you check the actual implementation, it is just subclass Layer class Source Code Link Here but has #keras_export('keras.layers.experimental.preprocessing.Rescaling')
#keras_export('keras.layers.experimental.preprocessing.Rescaling')
class Rescaling(Layer):
"""Multiply inputs by `scale` and adds `offset`.
For instance:
1. To rescale an input in the `[0, 255]` range
to be in the `[0, 1]` range, you would pass `scale=1./255`.
2. To rescale an input in the `[0, 255]` range to be in the `[-1, 1]` range,
you would pass `scale=1./127.5, offset=-1`.
The rescaling is applied both during training and inference.
Input shape:
Arbitrary.
Output shape:
Same as input.
Arguments:
scale: Float, the scale to apply to the inputs.
offset: Float, the offset to apply to the inputs.
name: A string, the name of the layer.
"""
def __init__(self, scale, offset=0., name=None, **kwargs):
self.scale = scale
self.offset = offset
super(Rescaling, self).__init__(name=name, **kwargs)
def call(self, inputs):
dtype = self._compute_dtype
scale = math_ops.cast(self.scale, dtype)
offset = math_ops.cast(self.offset, dtype)
return math_ops.cast(inputs, dtype) * scale + offset
def compute_output_shape(self, input_shape):
return input_shape
def get_config(self):
config = {
'scale': self.scale,
'offset': self.offset,
}
base_config = super(Rescaling, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
So it proves that Rescaling preprocessing is just another normal layer.
The main part is the def call(self, inputs) function. You can create whatever complicated logic to preprocess your inputs and then return.
A easier documentation about custom layer can be find here
In a nutshell, you can do the preprocessing by layer, either by Lambda which for simple operation or by subclassing Layer to achieve your goal.
I think that the best and cleaner solution to do this is using a simple Lambda layer where you can wrap your pre-processing function
this is a dummy working example
import numpy as np
from tensorflow.keras.layers import *
from tensorflow.keras.models import *
X = np.random.randint(0,256, (200,32,32,3))
y = np.random.randint(0,3, 200)
inp = Input((32,32,3))
x = Lambda(lambda x: x/255)(inp)
x = Conv2D(8, 3, activation='relu')(x)
x = Flatten()(x)
out = Dense(3, activation='softmax')(x)
m = Model(inp, out)
m.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
history = m.fit(X, y, epochs=10)
in order to understand how this code works, I have written a small reproducer. How does the self.hidden variable use a variable x in the forward method?
enter code class Network(nn.Module):
def __init__(self):
super().__init__()
# Inputs to hidden layer linear transformation
self.hidden = nn.Linear(784, 256)
# Output layer, 10 units - one for each digit
self.output = nn.Linear(256, 10)
# Define sigmoid activation and softmax output
self.sigmoid = nn.Sigmoid()
self.softmax = nn.Softmax(dim=1)
def forward(self, x):
# Pass the input tensor through each of our operations
x = self.hidden(x)
x = self.sigmoid(x)
x = self.output(x)
x = self.softmax(x)
return x
You misunderstood what self.hidden = nn.Linear(784, 256) does. You wrote that:
hidden is defined as a function
but this is not true. self.hidden is an object of the class nn.Linear. And when you call self.hidden(...), you are not passing arguments to nn.Linear; you are passing arguments to __call__ (defined in the nn.Linear class).
If you want more details on that, I have expanded on how it works in PyTorch: see this answer.
I am trying to swap between multiple different "expert" layers based on the output of a "gating" layer (as a mixture of experts).
I created a custom layer that takes in the outputs of the expert and gating layers, but this ends up throwing away some outputs rather than not computing them in the first place.
How can I make the model "short circuit" to only evaluate the gating layer and the selected expert layer(s) to save computation time?
I am using tensorflow 2.0 gpu and the keras functional api
Keras models can be implemented fully dynamically, to support the efficient routing that you mentioned. The following example shows one way in which this can be done. The example is written with the following premises:
It assumes there are two experts (LayerA and LayerB)
It assumes that a mix-of-experts model (MixOfExpertsModel) switches dynamically between the two expert layer classes depending on the per-example output of a Keras Dense layer
It satisfies the need to run training on the model in a batch fashion.
Pay attention to the comments in the code to see how the switching is done.
import numpy as np
import tensorflow as tf
# This is your Expert A class.
class LayerA(tf.keras.layers.Layer):
def build(self, input_shape):
self.weight = self.add_weight("weight_a", shape=input_shape[1:])
#tf.function
def call(self, x):
return x + self.weight
# This is your Expert B class.
class LayerB(tf.keras.layers.Layer):
def build(self, input_shape):
self.weight = self.add_weight("weight_b", shape=input_shape[1:])
#tf.function
def call(self, x):
return x * self.weight
class MixOfExpertsModel(tf.keras.models.Model):
def __init__(self):
super(MixOfExpertsModel, self).__init__()
self._expert_a = LayerA()
self._expert_b = LayerB()
self._gating_layer = tf.keras.layers.Dense(1, activation="sigmoid")
#tf.function
def call(self, x):
z = self._gating_layer(x)
# The switching logic:
# - examples with gating output <= 0.5 are routed to expert A
# - examples with gating output > 0.5 are routed to expert B.
mask_a = tf.squeeze(tf.less_equal(z, 0.5), axis=-1)
mask_b = tf.squeeze(tf.greater(z, 0.5), axis=-1)
# `input_a` is a subset of slices of the original input (`x`).
# So is `input_b`. As such, no compute is wasted.
input_a = tf.boolean_mask(x, mask_a, axis=0)
input_b = tf.boolean_mask(x, mask_b, axis=0)
if tf.size(input_a) > 0:
output_a = self._expert_a(input_a)
else:
output_a = tf.zeros_like(input_a)
if tf.size(input_b) > 0:
output_b = self._expert_b(input_b)
else:
output_b = tf.zeros_like(input_b)
# Return `mask_a`, and `mask_b`, so that the caller can know
# which example is routed to which expert and whether its output
# appears in `output_a` or `output_b`. # This is necessary
# for writing a (custom) loss function for this class.
return output_a, output_b, mask_a, mask_b
# Create an intance of the mix-of-experts model.
mix_of_experts_model = MixOfExpertsModel()
# Generate some dummy data.
num_examples = 32
xs = np.random.random([num_examples, 8]).astype(np.float32)
# Call the model.
print(mix_of_experts_model(xs))
I didn't write a custom loss function that would support the training of this class. But that's doable by using the return values of MixOfExpertsModel.call(), namely the outputs and masks.