When training my network I am occasionally met with the warning:
W0722 11:47:35.101842 140641577297728 optimizer_v2.py:928] Gradients does not exist for variables ['model/conv1d_x/Variable:0'] when minimizing the loss.
This happens sporadically at infrequent intervals (maybe once in every 20 successful steps). My model basically has two paths which join together with concatenations at various positions in the network. To illustrate this, here is a simplified example of what I mean.
class myModel(tf.keras.Model):
def __init__(self):
self.conv1 = Conv2D(32)
self.conv2 = Conv2D(32)
self.conv3 = Conv2D(16)
def call(self, inputs):
net1 = self.conv1(inputs)
net2 = self.conv2(inputs)
net = tf.concat([net1, net2], axis=2)
net = self.conv3(net)
end_points = tf.nn.softmax(net)
model = myModel()
with tf.GradientTape() as tape:
predicition = model(image)
loss = myloss(labels, prediction)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
In reality my network is much larger, but the variables that generally don't have gradients tend to be the ones at the top of the network. Before each Conv2D layer I also have a custom gradient. Sometimes when I the error appears I can notice that the gradient function for that layer has not been called.
My question is how can the gradient tape sometimes take what appears to be different paths when propagating backwards through my network. My secondary question, is this caused by having two separate routes through my network (i.e. conv1 AND conv2). Is there a fundamental flaw in this network architecture?
Ideally, could I define to the GradientTape() that it must find the gradients for each of the top layers?
I had an issue that seems similar - may be helpful or not sure depending on what your network actually looks like, but basically, I had a multi-output network and I realised that as I was applying gradients that corresponded to the outputs separately, so for each separate loss there was a branch of the network for which the gradient was zero, but this was totally valid and corresponded to the terminal layers immediately prior to the non-targeted outputs each time. For this reason, I ended up replacing any None gradients with tf.zeros_like and it was possible to proceed with training. Could you have the same problem with multiple input heads to your network, if it's always at the top of the graph?
(ETA solution by Nguyễn Thu below is the code version of what I'm describing in above - exactly the same way that I dealt with it)
I've seen other answers where gradients weren't calculating because tensors aren't watched by default - you have to add them, but looks like that's not your issue as you should be only dealing with model.trainable_variables, or perhaps your myLoss function is getting a NaN result or casting to a numpy array occasionally depending on your batch composition, which would explain the sporadic nature (e.g. perhaps it's on batches that have no instances of a minority class if your data is very imbalanced?)
The solution given by Nguyễn and gkennos will suppress the error because it would replace all None by zeros.
However, it is a big issue that your gradient is null at any point in time.
The problem described above is certainly caused by unconnected variables (by default PyTorch will throw runtime error).
The most common case of unconnected layers can be exemplify as follow:
def some_func(x):
x1 = x * some variables
x2 = x1 + some variables #x2 discontinued after here
x3 = x1 / some variables
return x3
Now observe that x2 is unconnected, so gradient will not be propagated throw it. Carefully debug your code for unconnected variables.
If missing gradients are expected, this warning can be suppressed by this workaround:
optimizer.apply_gradients(
(grad, var)
for (grad, var) in zip(gradients, model.trainable_variables)
if grad is not None
)
Gradient tape's gradient method has a unconnected_gradients parameter that allows you to specify whether unconnected gradients should be None or Zero. See docs: https://www.tensorflow.org/api_docs/python/tf/GradientTape#gradient
So you could change the line:
gradients = tape.gradient(loss, model.trainable_variables)
to
gradients = tape.gradient(loss, model.trainable_variables,
unconnected_gradients=tf.UnconnectedGradients.ZERO)
This worked for me.
EDIT - IMPORTANT: This is only a solution if you actually expect some gradients to be zero. This is NOT a solution if the error results from a broken backpropagation. In that case you will need to find and fix where it is broken.
I had the same problem. Found the solution with customized gradients
def _compute_gradients(tensor, var_list):
grads = tf.gradients(tensor, var_list)
return [grad if grad is not None else tf.zeros_like(var)
for var, grad in zip(var_list, grads)]
from github trouble shoot
I also encoutered the same error. It was because I gave the wrong trainable variables in tape.gradient() function. If it can help someone.
In my example self.encoder_model.get_trainable_variables() was not returning the good variables:
#tf.function
def train_step(x_batch):
with tf.GradientTape() as tape:
loss = self.encoder_model.loss.compute_loss(x_batch)
gradients = tape.gradient(loss, self.encoder_model.get_trainable_variables())
self.optimizer.apply_gradients(zip(gradients, self.encoder_model.get_trainable_variables()))
Revisiting this question, it is actually quite unhelpful and probably should have been down voted more! There are many scenarios where your gradient has invalid values in it. But ultimately, at some point in the gradient computation a NaN value was created.
In my scenario I was using custom gradient op, and ultimately there was a bug in my gradient calculation code. This bug caused the NaN under some circumstances.
If you are not using custom gradient ops, then likely you've either made a mistake in your network definition (e.g., disconnected variable as other answers suggest) or there is some issue with your data.
In summary, no one problem will cause this, it just an artefact from a) buggy gradient calculation, b) buggy network definition, c) issue with your data or d) anything else. There is no one solution for this question, it's just the result of an error somewhere else.
To directly answer my questions in the original post:
Q. How can the gradient tape sometimes take what appears to be different paths when propagating backwards through my network?
A. It doesn't, a bug in the input to the gradient function resulted in no gradients being calcucated for that layer.
Q. My secondary question, is this caused by having two separate routes through my network (i.e. conv1 AND conv2). Is there a fundamental flaw in this network architecture?
A. No, there is nothing wrong with this architecture.
there are no gradients because the variable doesn't affect the answer.
in this code, the call function is missing a return
class myModel(tf.keras.Model):
def __init__(self):
self.conv1 = Conv2D(32)
self.conv2 = Conv2D(32)
self.conv3 = Conv2D(16)
def call(self, inputs):
net1 = self.conv1(inputs)
net2 = self.conv2(inputs)
net = tf.concat([net1, net2], axis=2)
net = self.conv3(net)
return end_points = tf.nn.softmax(net) # Change this line
TLDR make sure you are using CategoricalCrossentropy and not BinaryCrossentropy
An incorrect loss function for your application could cause this. For example if your outputs are one-hot encoded categorical labels e.g. [0,1] or [1,0] you need to use a Categorical cross entropy loss. If you use something like a Binary Cross Entropy loss by mistake then no gradients will be produced for gradients leading to the non-zeroth component of the NN output.
Related
I'm currently working on a solution via PyTorch. I'm not going to share the exact solution but I will provide code that reproduces the issue I'm having.
I have a model defined as follows:
class Net(nn.Module):
def __init__(self):
super(Net,self).__init__()
self.fc1 = nn.Linear(10,4)
def foward(self,x):
return nn.functional.relu(self.fc1(x))
Then I create a instance: my_model = Net(). Next I create an Adam optimizer as such:
optim = Adam(my_model.parameters())
# create a random input
inputs = torch.tensor(np.array([1,1,1,1,1,2,2,2,2,2]),dtype=torch.float32,requires_grad=True)
# get the outputs
outputs = my_model(inputs)
# compute gradients / backprop via
outputs.backward(gradient=torch.tensor([1.,1.,1.,5.]))
# store parameters before optimizer step
before_step = list(my_model.parameters())[0].detach().numpy()
# update parameters via
optim.step()
# collect parameters again
after_step = list(my_model.parameters())[0].detach().numpy()
# Print if parameters are the same or not
print(np.array_equal(before_step,after_step)) # Prints True
I provided my models parameters to the Adam optimizer, so I'm not exactly sure why the parameters aren't updating. I know in most cases one uses a loss function, however I cannot do that in my case but I assumed if I specified model paramters to the optimizers, it would know to connect the two.
Anyone know why the parameters aren't getting updated?
The problem is with detach (docs).
As noted at the bottom:
Returned Tensor shares the same storage with the original one. In-place modifications on either of them will be seen, and may trigger errors in correctness checks
So that is exactly what's happening here. To correctly compare the parameters, you need to clone (docs) them to get a real copy.
list(my_model.parameters())[0].clone().detach().numpy()
On a side note, it can be helpful if you check the gradients after optim.step() with print(list(my_model.parameters())[0].grad) to check if the graph is intact. Also, don't forget to call optim.zero_grad().
I am trying to understand if it is feasible in TF 2.0 to symbolically add to the weight variable of LSTM (self.kernel) another tf.Variable which I will control during training (it can be non-trainable itself).
E.g:
class AwesomeLSTM(tf.keras.layers.LSTM)
def build(...)
super().build(...)
self.new_weight = self.add_weight(shape=self.kernel.shape, ...)
self.kernel = self.kernel + self.new_weight
but when I change self.new_weight the value of self.kernel is not changed. Any ideas?
I had a similar issue in a project of mine (not on a LSTM though).
From what I could tell the problem is that when model.build() is called, the numpy value of self.new_weight is taken, not the symbolic variable.
A workaround that worked for me is adding the new_weight in the call function. In this case you need a different kernel so the addition doesn't aggregate:
class my_LSTM(tf.keras.layers.LSTM):
def build(self,input_shape):
super().build(input_shape)
self.new_weight = self.add_weight(shape=self.kernel.shape,trainable=False)
self.actual_kernel = self.add_weight(shape=self.kernel.shape,trainable=True)
def call(self,inputs,*args):
self.kernel = self.actual_kernel+self.new_weight
super().call(inputs,*args)
When testing this you need to keep in mind that the layer.kernel doesn't change until the layer is called again.
Hope this helps, when someone finds a way to do it symbolically as you probably intended I would also be very interested.
do you know how can I apply a custom regularization function to CNTK?
In particular, I would like to add to the loss the derivative of the functino wrt to the inputs; something like
newLoss = loss + lambda * gradient_F(inputs)
where F is the function learned by the model and inputs are the inputs to the model.
How can I achieve this in CNTK? I don't know how to access the gradients wrt to the inputs, and how to take the gradient wrt to the weights of the regularizer.
First, gradient is not a scalar, so it doesn't make a lot of sense to optimize it. The gradient norm might be an interesting thing to add to your loss. To do that, CNTK would have to take the gradient of the gradient norm, which at the time of this writing (July 2017) is not supported. It is however an important feature we want to add in the next few months.
Update: One workaround is to do something like this
noisy_inputs = x + C.random.normal_like(x, scale=0.01)
noisy_model = model.clone('share', {x: noisy_inputs})
auxiliary_loss = C.squared_error(model, noisy_model)
but you will have to tune the scale of the noise for your problem.
CNTK learners only accept numbers as regularizer (L1/L2) values. If you really want to add your custom regularizer, you can easily implement your own Learner. You will have access to the gradients you need. You will find couple of examples on how to implement your own Learner here.
Here's the code to do this:
def cross_entropy_with_softmax_plus_regularization(model, labels, l2_regularization_weight):
w_norm = C.Constant(0);
for p in (model.parameters):
w_norm = C.plus(w_norm, 0.5*C.reduce_sum(C.square(p)))
return C.reduce_log_sum_exp(model.output) -
C.reduce_log_sum_exp(C.times_transpose(labels, model.output)) + l2_regularization_weight*w_norm
and my http://www.telesens.co/2017/09/29/spiral_cntk/ blog post about it
I'm using tensorflow to do a gradient decent classification.
train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
here cost is the cost function that I have used in optimization.
After launching the Graph in the Session, the Graph can be fed as:
sess.run(train_op, feed_dict)
And with this, all the variables in the cost function will be updated in order to minimized the cost.
Here is my question. How can I update only some variables in the cost function when training..? Is there a way to convert created variables into constants or something..?
There are several good answers, this subject should already be closed:
stackoverflow
Quora
Just to avoid another click for people getting here :
The minimize function of the tensorflow optimizer takes a var_list argument for that purpose:
first_train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
"scope/prefix/for/first/vars")
first_train_op = optimizer.minimize(cost, var_list=first_train_vars)
second_train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
"scope/prefix/for/second/vars")
second_train_op = optimizer.minimize(cost, var_list=second_train_vars)
I took it as is from mrry
To get the list of the names you should use instead of "scope/prefix/for/second/vars" you can use :
tf.get_default_graph().get_collection_ref(tf.GraphKeys.TRAINABLE_VARIABLES)
I want to write a new optimization algorithm for my network on Tensorflow. I hope to implement the Levenberg Marquardt optimization algorithm, which now is excluded from TF API. I found poor documentation on how to write a custom optimizer, so i ask if someone can give my any advice. Thanks.
The simplest example of an optimizer is probably the gradient descent optimizer. It shows how one creates an instance of the basic optimizer class. The optimizer base class documentation explains what the methods do.
The python side of the optimizers adds new nodes to the graph that compute and apply the gradients being back-propagated. It supplies the parameters that get passed to the ops and does some of the high-level management of the optimizer. Then, you need the actual "Apply" op.
Ops have both a python and a C++ component. Writing a training op is the same (but specialized) as the general process of adding an Op to TensorFlow.
For an example set of training ops that compute and apply gradients, see
python/training/training_ops.py - this is the Python glue for the actual training ops. Note that the code here is mostly about shape inference - the computation is going to be in the C++.
The actual math for applying the gradients is handled by an Op (recalling that, in general, ops are written in C++). In this case, the apply gradients ops are defined in core/kernels/training_ops.cc. You can see, for example, the implementation of ApplyGradientDescentOp in there, which references a functor ApplyGradientDescent:
var.device(d) -= grad * lr();
The implementation of the Op itself follows the implementation of any other op as described in the adding-an-op docs.
Before running the Tensorflow Session, one should initiate an Optimizer as seen below:
# Gradient Descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.train.GradientDescentOptimizer is an object of the class GradientDescentOptimizer and as the name says, it implements the gradient descent algorithm.
The method minimize() is being called with a “cost” as parameter and consists of the two methods compute_gradients() and then apply_gradients().
For most (custom) optimizer implementations, the method apply_gradients() needs to be adapted.
This method relies on the (new) Optimizer (class), which we will create, to implement the following methods: _create_slots(), _prepare(), _apply_dense(), and _apply_sparse().
_create_slots() and _prepare() create and initialise additional
variables, such as momentum.
_apply_dense(), and _apply_sparse() implement the actual Ops, which update the variables.
Ops are generally written in C++ . Without having to change the C++ header yourself, you can still return a python wrapper of some Ops through these methods.
This is done as follows:
def _create_slots(self, var_list):
# Create slots for allocation and later management of additional
# variables associated with the variables to train.
# for example: the first and second moments.
'''
for v in var_list:
self._zeros_slot(v, "m", self._name)
self._zeros_slot(v, "v", self._name)
'''
def _apply_dense(self, grad, var):
#define your favourite variable update
# for example:
'''
# Here we apply gradient descents by substracting the variables
# with the gradient times the learning_rate (defined in __init__)
var_update = state_ops.assign_sub(var, self.learning_rate * grad)
'''
#The trick is now to pass the Ops in the control_flow_ops and
# eventually groups any particular computation of the slots your
# wish to keep track of:
# for example:
'''
m_t = ...m... #do something with m and grad
v_t = ...v... # do something with v and grad
'''
return control_flow_ops.group(*[var_update, m_t, v_t])
For a more detailed explanation with example, see this blog post
https://www.bigdatarepublic.nl/custom-optimizer-in-tensorflow/