I am forwarding, and backpropping tensor data X through two simple nn.Module PyTorch models instances, model1 and model2.
I can't get this process to work without usage of the depreciated Variable API.
So this works just fine:
y1 = model1(X)
v = Variable(y1.data, requires_grad=training) # Its all about this line!
y2 = model2(v)
criterion = nn.NLLLoss()
loss = criterion(y2, y)
loss.backward()
y1.backward(v.grad)
self.step()
But this will throw an error:
y1 = model1(X)
y2 = model2(y1)
criterion = nn.NLLLoss()
loss = criterion(y2, y)
loss.backward()
y1.backward(y1.grad) # it breaks here
self.step()
>>> RuntimeError: grad can be implicitly created only for scalar outputs
I just can't seem to find a relevant difference between v in the first implementation, and y1 in the second. In both cases requires_grad is set to True. The only thing I could find was that y1.grad_fn=<ThnnConv2DBackward> and v.grad_fn=<ThnnConv2DBackward>
What am I missing here? What (tensor attributes?) do I not know about, and if Variable is depreciated, what other implementation would work?
[UPDATED]
You are not correctly passing the y1.grad into y1.backward in the second example. After the first backward all the intermediate gradient will be destroyed, you need a special hook to extract that gradients. And in your case you are passing the None value. Here is small example to reproduce your case:
Code:
import torch
import torch.nn as nn
torch.manual_seed(42)
class Model1(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
return x.pow(3)
class Model2(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
return x / 2
model1 = Model1()
model2 = Model2()
criterion = nn.MSELoss()
X = torch.randn(1, 5, requires_grad=True)
y = torch.randn(1, 5)
y1 = model1(X)
y2 = model2(y1)
loss = criterion(y2, y)
# We are going to backprop 2 times, so we need to
# retain_graph=True while first backward
loss.backward(retain_graph=True)
try:
y1.backward(y1.grad)
except RuntimeError as err:
print(err)
print('y1.grad: ', y1.grad)
Output:
grad can be implicitly created only for scalar outputs
y1.grad: None
So you need to extract them correctly:
Code:
def extract(V):
"""Gradient extractor.
"""
def hook(grad):
V.grad = grad
return hook
model1 = Model1()
model2 = Model2()
criterion = nn.MSELoss()
X = torch.randn(1, 5, requires_grad=True)
y = torch.randn(1, 5)
y1 = model1(X)
y2 = model2(y1)
loss = criterion(y2, y)
y1.register_hook(extract(y1))
loss.backward(retain_graph=True)
print('y1.grad', y1.grad)
y1.backward(y1.grad)
Output:
y1.grad: tensor([[-0.1763, -0.2114, -0.0266, -0.3293, 0.0534]])
After some investigation I came to the following two solutions.
The solution provided elsewhere in this thread retained the computation graph manually, without an option the free them, thus running fine initially, but causing OOM errors later on.
The first solution is to tie the models together using the built in torch.nn.Sequential as such:
model = torch.nn.Sequential(Model1(), Model2())
it's as easy as that. It looks clean and behaves exactly like an ordinary model would.
The alternative is to simply tie them together manually:
model1 = Model1()
model2 = Model2()
y1 = model1(X)
y2 = model2(y1)
loss = criterion(y2, y)
loss.backward()
My fear that this would only backpropagate model2 turned out to be unsubstantiated, since model1 is also stored in the computation graph that is back propagated over.
This implementation enabled inceased transparancy of the interface between the two models, compared to the previous implementation.
Related
I tried to find documents but cannot find anything about torch.softmax.
What is the difference among torch.nn.Softmax, torch.nn.funtional.softmax, torch.softmax and torch.nn.functional.log_softmax?
Examples are appreciated.
import torch
x = torch.rand(5)
x1 = torch.nn.Softmax()(x)
x2 = torch.nn.functional.softmax(x)
x3 = torch.nn.functional.log_softmax(x)
print(x1)
print(x2)
print(torch.log(x1))
print(x3)
tensor([0.2740, 0.1955, 0.1519, 0.1758, 0.2029])
tensor([0.2740, 0.1955, 0.1519, 0.1758, 0.2029])
tensor([-1.2946, -1.6323, -1.8847, -1.7386, -1.5952])
tensor([-1.2946, -1.6323, -1.8847, -1.7386, -1.5952])
torch.nn.Softmax and torch.nn.functional.softmax gives identical outputs, one is a class (pytorch module), another one is a function.
log_softmax applies log after applying softmax.
NLLLoss takes log-probabilities (log(softmax(x))) as input. So, you would need log_softmax for NLLLoss, log_softmax is numerically more stable, usually yields better results.
import torch
import torch.nn as nn
class Network(nn.Module):
def __init__(self):
super().__init__()
self.layer_1 = nn.LazyLinear(128)
self.activation = nn.ReLU()
self.layer_2 = nn.Linear(128, 10)
self.output_function = nn.Softmax(dim=1)
def forward(self, x, softmax="module"):
y = self.layer_1(x)
y = self.activation(y)
y = self.layer_2(y)
if softmax == "module":
return self.output_function(y)
# OR
if softmax == "torch":
return torch.softmax(y, dim=1)
# OR (deprecated)
if softmax == "functional":
return nn.functional.softmax(y, dim=1)
# OR (careful, the reason why the log is there is to ensure
# numerical stability so you should use torch.exp wisely)
if softmax == "log":
return torch.exp(torch.log_softmax(y, dim=1))
raise ValueError(f"Unknown softmax type {softmax}")
x = torch.rand(2, 2)
net = Network()
for s in ["module", "torch", "log"]:
print(net(x, softmax=s))
Basically nn.Softmax() creates a module, so it returns a function whereas the others are pure functions.
Why would you need a log softmax? Well an example lies in the docs of nn.Softmax:
This module doesn't work directly with NLLLoss,
which expects the Log to be computed between the Softmax and itself.
Use LogSoftmax instead (it's faster and has better numerical properties).
See also What is the difference between log_softmax and softmax?
I have a TensorFlow model f(x) and I sometimes need its gradients and sometimes not, depending on the result of the forward pass. In order to save computation time, I only want to compute the gradients when I need them. If I stop the gradient computation using stop_gradient() or don't record them on a GradientTape, it seems like I can never obtain the gradients without computing the forward pass again. A simplified example of what I'm trying to do looks like this (in pseudocode):
x = 5
y = f(x)
if y > 0:
compute_gradients(f, x)
Is it possible to accomplish this in TensorFlow and if so, how would I do that?
Yes, you can skip gradient updates with a simple conditional.
import tensorflow as tf
from tensorflow.python.platform import test as test_lib
# network
x_in = tf.keras.Input([10])
x_out = tf.keras.layers.Dense(1)(x_in)
# optimizer
opt = tf.keras.optimizers.Adam(1e-1)
# forward pass
def train_step(model, X, y, threshold):
with tf.GradientTape() as tape:
y_hat = model(X)
# threshold = tf.math.reduce_mean(y_hat)
loss = tf.math.reduce_mean(tf.keras.losses.MSE(y, y_hat))
if tf.math.greater(threshold, 1.0):
m_vars = model.trainable_variables
m_grads = tape.gradient(loss, m_vars)
opt.apply_gradients(zip(m_grads, m_vars))
return loss
# test cases
class SporaticGradientUpdateTest(test_lib.TestCase):
def setUp(self):
self.model = tf.keras.Model(x_in, x_out)
self.X = tf.random.normal([100, 10])
self.y = tf.random.normal([100])
self.w_before = self.model.get_weights()
def test_weights_dont_change(self):
_ = train_step(self.model, self.X, self.y, 0.99)
# get weights that shouldn't have updated
w_after = self.model.get_weights()
self.assertAllClose(self.w_before, w_after)
def test_weights_change(self):
_ = train_step(self.model, self.X, self.y, 1.01)
# get weights that should updated
w_after = self.model.get_weights()
self.assertNotAllClose(self.w_before, w_after)
if __name__ == "__main__":
test_lib.main()
# [ RUN ] SporaticGradientUpdate.test_weights_change
# [ OK ] SporaticGradientUpdate.test_weights_change
# [ RUN ] SporaticGradientUpdate.test_weights_dont_change
# [ OK ] SporaticGradientUpdate.test_weights_dont_change
Per your comment, it looks like your use-case is a little different than this example but should be adaptable to whatever you are trying to do.
In the example, I passed in the threshold as an arg so I could test both cases, but normally you would create it by doing something to the output of the network (like the commented out portion).
I am trying to debug a relatively complex custom training method using custom loss functions, etc. In particular I am trying to debug an issue in a custom training step, which is compiled into a Tensorflow #function and fitted as a Keras compiled model. I want to be able to print out an intermediate value of a tensor in a function call that is crashing. The difficulty is that since tensors inside an #function are graph values and arent evaluated immediately, and since the function crashes during evaluation, it seems like the values aren't actually calculated. Here is a simple example:
class debug_model(tf.keras.Model):
def __init__(self, width,depth,insize,outsize,batch_size):
super(debug_model, self).__init__()
self.width = width
self.depth = depth
self.insize = insize
self.outsize = outsize
self.net = tf.keras.models.Sequential()
self.net.add(tf.keras.Input(shape = (insize,)))
for i in range(depth):
self.net.add(tf.keras.layers.Dense(width,activation = 'swish'))
self.net.add(tf.keras.layers.Dense(outsize))
def call(self,ipts):
return self.net(ipts)
#tf.function
def train_step(self,data):
ipt, target = data
with tf.GradientTape(persistent=True) as tape_1:
tape_1.watch(ipt)
y = self(ipt)
tf.print('y:',y)
assert False
loss = tf.keras.losses.MAE(target,y)
trainable_vars = self.trainable_variables
loss_grad = tape_1.gradient(loss,trainable_vars)
self.optimizer.apply_gradients(zip(loss_grad, trainable_vars))
self.compiled_metrics.update_state(target, y)
# Return a dict mapping metric names to current value
return {m.name: m.result() for m in self.metrics}
If you compile this model with some data of your choice and run it:
train_set = tf.data.Dataset.from_tensor_slices(data_tuple).batch(opt.batchSize)
train_set.shuffle(buffer_size = trainpoints)
model = debug_model(opt.width,opt.depth,in_size,out_size,batchSize)
optimizer = tf.keras.optimizers.Adam(learning_rate=opt.lr)
lr_sched = lambda epoch, lr: lr * 0.95**(1 / (8))
cb_scheduler = tf.keras.callbacks.LearningRateScheduler(schedule = lr_sched, verbose = 1)
model.build((None,1))
model.summary()
model.compile(optimizer=optimizer,
loss = tf.keras.losses.MeanAbsoluteError(),
)
callbacks = [
tf.keras.callbacks.ModelCheckpoint(path,
verbose=2
),
cb_scheduler,
tf.keras.callbacks.CSVLogger(path+'log.csv')
]
hist = model.fit(train_set,epochs = opt.nEpochs,callbacks = callbacks)
If you load this up and run it you will see that it exits due to the assertion error without printing. Is there a way I can force this tensor to evaluate so I can print it?
I have a model which has noisy linear layers (for which you can sample values from a mu and sigma parameter) and need to create two decorrelated outputs of it.
This means I have something like:
model.sample_noise()
output_1 = model(input)
with torch.no_grad():
model.sample_noise()
output_2 = model(input)
sample_noise actually modifies weights attached to the model according to a normal distribution.
But in the end this leads to
RuntimeError: one of the variables needed for gradient computation has been
modified by an inplace operation
The question actually is, what's the best way to avoid modifying these parameters. I could actually deepcopy the model every iteration and then use it for the second forward pass, but this does not sound very efficient to me.
If I understand your problem correctly, you want to have a linear layer with matrix M and then create two outputs
y_1 = (M + μ_1) * x + b
y_2 = (M + μ_2) * x + b
where μ_1, μ_2 ~ P. The simplest way would be, in my opinion, to create a custom class
import torch
import torch.nn.functional as F
from torch import nn
class NoisyLinear(nn.Module):
def __init__(self, n_in, n_out):
super(NoisyLinear, self).__init__()
# or any other initialization you want
self.weight = nn.Parameter(torch.randn(n_out, n_in))
self.bias = nn.Parameter(torch.randn(n_out))
def sample_noise(self):
# implement your noise generation here
return torch.randn(*self.weight.shape) * 0.01
def forward(self, x):
noise = self.sample_noise()
return F.linear(x, self.weight + noise, self.bias)
nl = NoisyLinear(4, 3)
x = torch.randn(2, 4)
y1 = nl(x)
y2 = nl(x)
print(y1, y2)
Right now I have a model configured to take its inputs with feed_dict. The code looks something like this:
# model.py
class MyModel(object):
def __init__(self, hyperparams):
self.build_model(hyperparams)
def build_model(self, hps):
self.input_data = tf.placeholder(dtype=tf.float32, shape=[hps.batch_size, hps.nfeats])
self.labels = tf.placeholder(dtype=tf.float32, shape=[hps.batch_size])
# Define hidden layers, loss, training step, etc.
# train.py
model = MyModel(hps)
for _ in range(100):
x, y = some_python_function() # Read a batch from disk, preprocess
sess.run(model.train_step, feed_dict={model.input_data: x, model.labels: y})
For performance reasons, I'd like to switch to using queues for training. But I'd like to maintain the ability to use feed_dict, e.g. for inference or testing.
Is there an elegant way to do this? What I'd like to do is, when using queues, 'swap out' the placeholder variables for the tensors returned by my queue's dequeue op. I thought that tf.assign would be the way to do this, i.e.:
single_x, single_y = tf.parse_single_example(...)
x, y = tf.train.batch([single_x, single_y], batch_size)
model = MyModel(hps)
sess.run([tf.assign(model.input_data, x), tf.assign(model.labels, y)])
for _ in range(100):
sess.run(model.train_step)
But this raises AttributeError: 'Tensor' object has no attribute 'assign'. The API docs for tf.assign describe the first argument as: "A mutable Tensor. Should be from a Variable node. May be uninitialized." Does this mean my placeholders aren't mutable? Can I make them so? Or am I approaching this the wrong way?
Minimal runnable example here.
You could separate the creation of the Variables and the Operations by:
adding a build_variables method called at the instantiation of your Model class,
changing the interface of the build_model method so it accepts your xand y tensors as arguments and so it builds the model operations based on them.
This way you would reuse the variables and constants of your model. The downside being that the operations will be duplicated for the placeholder version and any other version.
import tensorflow as tf
import numpy as np
BATCH_SIZE = 2
class Model(object):
def __init__(self):
self.build_variables()
def build_variables(self):
self.w = tf.Variable(tf.random_normal([3, 1]))
def build_model(self, x, y):
self.x = x
self.y = y
self.output = tf.matmul(self.x, self.w)
self.loss = tf.losses.absolute_difference(self.y, self.output)
model = Model()
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
def placeholder_run():
x = tf.placeholder(dtype=tf.float32, shape=[BATCH_SIZE, 3])
y = tf.placeholder(dtype=tf.float32, shape=[BATCH_SIZE, 1])
model.build_model(x, y)
for i in range(3):
x = np.random.rand(BATCH_SIZE, 3)
y = x.sum(axis=1, keepdims=True)
loss = sess.run(model.loss, feed_dict={model.x:x, model.y:y})
print(loss)
def nonph_run():
x = tf.random_normal([BATCH_SIZE, 3])
y = tf.reduce_sum(x, axis=1, keep_dims=True)
model.build_model(x, y)
for i in range(3):
loss = sess.run(model.loss)
print(loss)
if __name__ == '__main__':
# Works
placeholder_run()
# Doesn't fail
nonph_run()
If you have control of your graph and know what you want upfront, you could use a switch on your input. For example,
x_plh = tf.placeholder(tf.float32, myshape)
x_dsk = my_input_from_disk()
use_dsk = tf.placeholder(tf.bool, ())
x = tf.cond(use_dsk, lambda: x_dsk, lambda: x_plh)
If you want a more flexible solution and take the somewhat pioneer route, you could have a go a the Dataset API of tensorflow. Take time to go through the doc, it is a nice read. A single Iterator can have several initializers using different Datasets, which could fit your case.