I am trying to understand if it is feasible in TF 2.0 to symbolically add to the weight variable of LSTM (self.kernel) another tf.Variable which I will control during training (it can be non-trainable itself).
E.g:
class AwesomeLSTM(tf.keras.layers.LSTM)
def build(...)
super().build(...)
self.new_weight = self.add_weight(shape=self.kernel.shape, ...)
self.kernel = self.kernel + self.new_weight
but when I change self.new_weight the value of self.kernel is not changed. Any ideas?
I had a similar issue in a project of mine (not on a LSTM though).
From what I could tell the problem is that when model.build() is called, the numpy value of self.new_weight is taken, not the symbolic variable.
A workaround that worked for me is adding the new_weight in the call function. In this case you need a different kernel so the addition doesn't aggregate:
class my_LSTM(tf.keras.layers.LSTM):
def build(self,input_shape):
super().build(input_shape)
self.new_weight = self.add_weight(shape=self.kernel.shape,trainable=False)
self.actual_kernel = self.add_weight(shape=self.kernel.shape,trainable=True)
def call(self,inputs,*args):
self.kernel = self.actual_kernel+self.new_weight
super().call(inputs,*args)
When testing this you need to keep in mind that the layer.kernel doesn't change until the layer is called again.
Hope this helps, when someone finds a way to do it symbolically as you probably intended I would also be very interested.
Related
I'm currently working on a solution via PyTorch. I'm not going to share the exact solution but I will provide code that reproduces the issue I'm having.
I have a model defined as follows:
class Net(nn.Module):
def __init__(self):
super(Net,self).__init__()
self.fc1 = nn.Linear(10,4)
def foward(self,x):
return nn.functional.relu(self.fc1(x))
Then I create a instance: my_model = Net(). Next I create an Adam optimizer as such:
optim = Adam(my_model.parameters())
# create a random input
inputs = torch.tensor(np.array([1,1,1,1,1,2,2,2,2,2]),dtype=torch.float32,requires_grad=True)
# get the outputs
outputs = my_model(inputs)
# compute gradients / backprop via
outputs.backward(gradient=torch.tensor([1.,1.,1.,5.]))
# store parameters before optimizer step
before_step = list(my_model.parameters())[0].detach().numpy()
# update parameters via
optim.step()
# collect parameters again
after_step = list(my_model.parameters())[0].detach().numpy()
# Print if parameters are the same or not
print(np.array_equal(before_step,after_step)) # Prints True
I provided my models parameters to the Adam optimizer, so I'm not exactly sure why the parameters aren't updating. I know in most cases one uses a loss function, however I cannot do that in my case but I assumed if I specified model paramters to the optimizers, it would know to connect the two.
Anyone know why the parameters aren't getting updated?
The problem is with detach (docs).
As noted at the bottom:
Returned Tensor shares the same storage with the original one. In-place modifications on either of them will be seen, and may trigger errors in correctness checks
So that is exactly what's happening here. To correctly compare the parameters, you need to clone (docs) them to get a real copy.
list(my_model.parameters())[0].clone().detach().numpy()
On a side note, it can be helpful if you check the gradients after optim.step() with print(list(my_model.parameters())[0].grad) to check if the graph is intact. Also, don't forget to call optim.zero_grad().
Using Pytorch, I am trying to implement a network that is using the pre=trained DeepLab ResNet-101.
I found two possible methods for using this network:
this one
or
torchvision.models.segmentation.deeplabv3_resnet101(
pretrained=False, progress=True, num_classes=21, aux_loss=None, **kwargs)
However, I might not only need this network's output, but also several inside layers' outputs.
Is there a way to access the inner layer outputs using one of these methods?
If not - Is it possible to manually copy the trained resnet's parameters so I can manually recreate it and add those outputs myself? (Hopefully the first option is possible so I won't need to do this)
Thanks!
You can achieve this without too much trouble using forward hooks.
The idea is to loop over the modules of your model, find the layers you're interested in, hook a callback function onto them. When called, those layers will trigger the hook. We will take advantage of this to save the intermediate outputs.
For example, let's say you want to get the outputs of layer classifier.0.convs.3.1:
layers = ['classifier.0.convs.3.1']
activations = {}
def forward_hook(name):
def hook(module, x, y):
activations[name] = y
return hook
for name, module in model.named_modules():
if name in layers:
module.register_forward_hook(forward_hook(name))
*The closure around hook() made by forward_hook's scope is used to enclose the module's name which you wouldn't otherwise have access to at this point.
Everything is ready, we can call the model
>>> model = torchvision.models.segmentation.deeplabv3_resnet101(
pretrained=True, progress=True, num_classes=21, aux_loss=None)
>>> model(torch.rand(16, 3, 100, 100))
And as expected, after inference, activations will have a new entry 'classifier.0.convs.3.1' which - in this case - will contain a tensor of shape (16, 256, 13, 13).
Not so long ago, I wrote an answer about a similar question which goes a little bit more in detail on how hooks can be used to inspect the intermediate output shapes.
When I'm training my model with TensorFlow 2.3, I want to visualize some intermediate tensors calculated using the weight in the computation graph of my customized tf.keras.layers.Layer.
So I use tf.summary.image() to record these tensors and visualize them as images like this:
class CustomizedLayer(tf.keras.layers.Layer):
def call(self, inputs, training=None):
# ... some code ...
tf.summary.image(name="some_weight_map", data=some_weight_map)
# ... some code ...
But in TensorBoard, no matter how many steps passed, there is only one image of step 0 shown.
And I tried to set the parameter step of tf.summary.image() to the value obtained from tf.summary.experimental.get_step():
tf.summary.image(name="weight_map", data=weight_map, step=tf.summary.experimental.get_step())
And update the step by calling tf.summary.experimental.set_step from a customized Callback using a tf.Variable like codes shown below:
class SummaryCallback(tf.keras.callbacks.Callback):
def __init__(self, step_per_epoch):
super().__init__()
self.global_step = tf.Variable(initial_value=0, trainable=False, name="global_step")
self.global_epoch = 0
self.step_per_epoch = step_per_epoch
tf.summary.experimental.set_step(self.global_step)
def on_batch_end(self, batch, logs=None):
self.global_step = batch + self.step_per_epoch * self.global_epoch
tf.summary.experimental.set_step(self.global_step)
# whether the line above is commented, calling tf.summary.experimental.get_step() in computation graph code always returns 0.
# tf.print(self.global_step)
def on_epoch_end(self, epoch, logs=None):
self.global_epoch += 1
This Callback's instance is passed in the argument callbacks in model.fit() function.
But the value tf.summary.experimental.get_step() returned is still 0.
The TensorFlow document of "tf.summary.experimental.set_step()" says:
when using this with #tf.functions, the step value will be captured at the time the function is traced, so changes to the step outside the function will not be reflected inside the function unless using a tf.Variable step.
Accroding to the document, I am already using a Variable to store the steps, but it's changes are still not reflected inside the function (or keras.Model).
Note: My code produces expected results in TensorFlow 1.x with just a simple line of tf.summary.image() before I migrate it to TensorFlow 2.
So I want to know if my approach is wrong in TensorFlow 2?
In TF2, how can I get training steps inside the computation graph?
Or there is other solution to summarize tensors (as scalar, image, etc.) inside a model in TensorFlow 2?
I found this issue has been reported on Github repository of Tensorflow: https://github.com/tensorflow/tensorflow/issues/43568
This is caused by using tf.summary in model while tf.keras.callbacks.TensorBoard callback is also enabled, and the step will always be zero. The issue reporter gives a temporary solution.
To fix it, inherit the tf.keras.callbacks.TensorBoard class and overwrite the on_train_begin method and on_test_begin method like this:
class TensorBoardFix(tf.keras.callbacks.TensorBoard):
"""
This fixes incorrect step values when using the TensorBoard callback with custom summary ops
"""
def on_train_begin(self, *args, **kwargs):
super(TensorBoardFix, self).on_train_begin(*args, **kwargs)
tf.summary.experimental.set_step(self._train_step)
def on_test_begin(self, *args, **kwargs):
super(TensorBoardFix, self).on_test_begin(*args, **kwargs)
tf.summary.experimental.set_step(self._val_step)
And use this fixed callback class in model.fit():
tensorboard_callback = TensorBoardFix(log_dir=log_dir, histogram_freq=1, write_graph=True, update_freq=1)
model.fit(dataset, epochs=200, callbacks=[tensorboard_callback])
This solve my problem and now I can get proper step inside my model by calling tf.summary.experimental.get_step().
(This issue may be fixed in later version of TensorFlow)
AKA Keras Model subclassing magic.
While playing with Keras, I noticed, that ResNetBlock.layers gets populated as I put new instances of layers into collections I previously put into my custom model.
class ResNetBlock(Model):
PART_COUNT = 3
def __init__(self, kernel_size, filters):
super().__init__()
self.convs = []
self.batchNorms = []
for part in range(ResNetBlock.PART_COUNT):
if part == 1:
conv = Conv2D(filters[part], kernel_size=kernel_size, padding="same")
else:
conv = Conv2D(filters[part], kernel_size=(1,1))
self.convs.append(conv)
self.batchNorms.append(BatchNormalization())
resnet = ResNetBlock(1, [1, 2, 3])
print(resnet.layers) # actually prints non-empty list
# filled with Conv2Ds and BNs from above
Adopted from official tutorial: https://www.tensorflow.org/beta/tutorials/eager/custom_layers
A bit of digging into TensorFlow source showed, that some kind of tracking is used via __setattr__ in Network class.
Now the code is not trivial, documentation lacking, and it seems unclear if the order of creating new layers/adding them to respective collections matters at all? E.g. if I first fill in convs collection, and only then batchNorms collection, would it still be the same model?
In most tutorials each layer is actually put into its own attribute.
Bonus question is: why is it done so implicitly? This kind of magic kinda breaks the motto to prefer explicit over implicit. What if for some reason I'd need to use a custom collection type not derived from list? How would I ensure these magic operations are done properly?
The order won't matter. What really changes your model is the call method. This stores the order of the operations (even if the order of the weights were variable, they would be applied in the same graph with the same functions)
Now, if you suspect that not using a "property", but using another kind of storage for the layers, would not register the layer for some reason, you can double check with:
print(len(resnet.trainable_weights))
The count should be 6 * PART_COUNT:
2 tensors for the conv layers (kernel and bias)
4 tensors for the BatchNormalization layers (mean, variance, scale and offset)
When training my network I am occasionally met with the warning:
W0722 11:47:35.101842 140641577297728 optimizer_v2.py:928] Gradients does not exist for variables ['model/conv1d_x/Variable:0'] when minimizing the loss.
This happens sporadically at infrequent intervals (maybe once in every 20 successful steps). My model basically has two paths which join together with concatenations at various positions in the network. To illustrate this, here is a simplified example of what I mean.
class myModel(tf.keras.Model):
def __init__(self):
self.conv1 = Conv2D(32)
self.conv2 = Conv2D(32)
self.conv3 = Conv2D(16)
def call(self, inputs):
net1 = self.conv1(inputs)
net2 = self.conv2(inputs)
net = tf.concat([net1, net2], axis=2)
net = self.conv3(net)
end_points = tf.nn.softmax(net)
model = myModel()
with tf.GradientTape() as tape:
predicition = model(image)
loss = myloss(labels, prediction)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
In reality my network is much larger, but the variables that generally don't have gradients tend to be the ones at the top of the network. Before each Conv2D layer I also have a custom gradient. Sometimes when I the error appears I can notice that the gradient function for that layer has not been called.
My question is how can the gradient tape sometimes take what appears to be different paths when propagating backwards through my network. My secondary question, is this caused by having two separate routes through my network (i.e. conv1 AND conv2). Is there a fundamental flaw in this network architecture?
Ideally, could I define to the GradientTape() that it must find the gradients for each of the top layers?
I had an issue that seems similar - may be helpful or not sure depending on what your network actually looks like, but basically, I had a multi-output network and I realised that as I was applying gradients that corresponded to the outputs separately, so for each separate loss there was a branch of the network for which the gradient was zero, but this was totally valid and corresponded to the terminal layers immediately prior to the non-targeted outputs each time. For this reason, I ended up replacing any None gradients with tf.zeros_like and it was possible to proceed with training. Could you have the same problem with multiple input heads to your network, if it's always at the top of the graph?
(ETA solution by Nguyễn Thu below is the code version of what I'm describing in above - exactly the same way that I dealt with it)
I've seen other answers where gradients weren't calculating because tensors aren't watched by default - you have to add them, but looks like that's not your issue as you should be only dealing with model.trainable_variables, or perhaps your myLoss function is getting a NaN result or casting to a numpy array occasionally depending on your batch composition, which would explain the sporadic nature (e.g. perhaps it's on batches that have no instances of a minority class if your data is very imbalanced?)
The solution given by Nguyễn and gkennos will suppress the error because it would replace all None by zeros.
However, it is a big issue that your gradient is null at any point in time.
The problem described above is certainly caused by unconnected variables (by default PyTorch will throw runtime error).
The most common case of unconnected layers can be exemplify as follow:
def some_func(x):
x1 = x * some variables
x2 = x1 + some variables #x2 discontinued after here
x3 = x1 / some variables
return x3
Now observe that x2 is unconnected, so gradient will not be propagated throw it. Carefully debug your code for unconnected variables.
If missing gradients are expected, this warning can be suppressed by this workaround:
optimizer.apply_gradients(
(grad, var)
for (grad, var) in zip(gradients, model.trainable_variables)
if grad is not None
)
Gradient tape's gradient method has a unconnected_gradients parameter that allows you to specify whether unconnected gradients should be None or Zero. See docs: https://www.tensorflow.org/api_docs/python/tf/GradientTape#gradient
So you could change the line:
gradients = tape.gradient(loss, model.trainable_variables)
to
gradients = tape.gradient(loss, model.trainable_variables,
unconnected_gradients=tf.UnconnectedGradients.ZERO)
This worked for me.
EDIT - IMPORTANT: This is only a solution if you actually expect some gradients to be zero. This is NOT a solution if the error results from a broken backpropagation. In that case you will need to find and fix where it is broken.
I had the same problem. Found the solution with customized gradients
def _compute_gradients(tensor, var_list):
grads = tf.gradients(tensor, var_list)
return [grad if grad is not None else tf.zeros_like(var)
for var, grad in zip(var_list, grads)]
from github trouble shoot
I also encoutered the same error. It was because I gave the wrong trainable variables in tape.gradient() function. If it can help someone.
In my example self.encoder_model.get_trainable_variables() was not returning the good variables:
#tf.function
def train_step(x_batch):
with tf.GradientTape() as tape:
loss = self.encoder_model.loss.compute_loss(x_batch)
gradients = tape.gradient(loss, self.encoder_model.get_trainable_variables())
self.optimizer.apply_gradients(zip(gradients, self.encoder_model.get_trainable_variables()))
Revisiting this question, it is actually quite unhelpful and probably should have been down voted more! There are many scenarios where your gradient has invalid values in it. But ultimately, at some point in the gradient computation a NaN value was created.
In my scenario I was using custom gradient op, and ultimately there was a bug in my gradient calculation code. This bug caused the NaN under some circumstances.
If you are not using custom gradient ops, then likely you've either made a mistake in your network definition (e.g., disconnected variable as other answers suggest) or there is some issue with your data.
In summary, no one problem will cause this, it just an artefact from a) buggy gradient calculation, b) buggy network definition, c) issue with your data or d) anything else. There is no one solution for this question, it's just the result of an error somewhere else.
To directly answer my questions in the original post:
Q. How can the gradient tape sometimes take what appears to be different paths when propagating backwards through my network?
A. It doesn't, a bug in the input to the gradient function resulted in no gradients being calcucated for that layer.
Q. My secondary question, is this caused by having two separate routes through my network (i.e. conv1 AND conv2). Is there a fundamental flaw in this network architecture?
A. No, there is nothing wrong with this architecture.
there are no gradients because the variable doesn't affect the answer.
in this code, the call function is missing a return
class myModel(tf.keras.Model):
def __init__(self):
self.conv1 = Conv2D(32)
self.conv2 = Conv2D(32)
self.conv3 = Conv2D(16)
def call(self, inputs):
net1 = self.conv1(inputs)
net2 = self.conv2(inputs)
net = tf.concat([net1, net2], axis=2)
net = self.conv3(net)
return end_points = tf.nn.softmax(net) # Change this line
TLDR make sure you are using CategoricalCrossentropy and not BinaryCrossentropy
An incorrect loss function for your application could cause this. For example if your outputs are one-hot encoded categorical labels e.g. [0,1] or [1,0] you need to use a Categorical cross entropy loss. If you use something like a Binary Cross Entropy loss by mistake then no gradients will be produced for gradients leading to the non-zeroth component of the NN output.