I have asked a similar question but no response. So I try it again,
I am reading a paper which suggest to add some value which is calculated outside of Tensorflow into the loss function of a neural network model in Tensorflow. i show you the quote here (I have blurred the not important part):
How do I add a precalculated value to the loss function when fitting a sequential Model in Tensorflow?
The Loss function used is BinaryCrossentropy, you can see it in the equation (4) in the paper quote. And the value added is shown in the quote but it is not important for the question i think.
It is also not important how my model looks like, i just want to add a constant value to my loss function in tensorflow when fitting my model.
Thank you very much!!
In the equation above as you can see, there can a chance when the outcome is very low i.e. the problem of vanishing gradient may occur.
In order to alleviate that, they are asking to add a constant value to the loss.
Now, you can a simple constant such 1, 10 or anything, or by something proportional to what they have said.
You can easily calculate the expectation from the ground truth for one part. The other part is the tricky one as you won't have values until you train and calculating them on the fly is not wise.
That term means how much difference between the ground truth and predictions will be there.
So, if you are going to implement this paper, then, add a constant value of 1 to your loss, so it doesn't vanish.
It seems that you want to be able to define your own loss. Also, I am not sure whether you use actual Tensorflow or Keras. Here is a solution with Keras:
import tensorflow.keras.backend as K
def my_custom_loss(precomputed_value):
def loss(y_true, y_pred):
return K.binary_crossentropy(y_true, y_pred) + precomputed_value
return loss
my_model = Sequential()
my_model.add(...)
# Add any layer there
my_model.compile(loss=my_custom_loss(42))
Inspired from https://towardsdatascience.com/advanced-keras-constructing-complex-custom-losses-and-metrics-c07ca130a618
EDIT: The answer was only for adding a constant term, but I realize that the term suggested in the paper is not constant.
I haven't read the paper, but I suppose from the cross-entropy definition that sigma is the ground truth and p is the predicted value. If there are no other dependency, the solution can even be simpler:
def my_custom_loss(y_pred, y_true):
norm_term = K.square( K.mean(y_true) - K.mean(y_pred) )
return K.binary_crossentropy(y_true, y_pred) + norm_term
# ...
my_model.compile(loss=my_custom_loss)
Here, I assumed the expectations are only computed on each batch. Tell me whether it is what you want. Otherwise, if you want to compute your statistics at a different scale, e.g. on the whole dataset after every epoch, you might need to use callbacks.
In that case, please give more precision on your problem, adding for instance a small example for y_pred and y_true, and the expected loss.
Related
Suppose you have a neural network with 2 layers A and B. A gets the network input. A and B are consecutive (A's output is fed into B as input). Both A and B output predictions (prediction1 and prediction2) Picture of the described architecture
You calculate a loss (loss1) directly after the first layer (A) with a target (target1). You also calculate a loss after the second layer (loss2) with its own target (target2).
Does it make sense to use the sum of loss1 and loss2 as the error function and back propagate this loss through the entire network? If so, why is it "allowed" to back propagate loss1 through B even though it has nothing to do with it?
This question is related to this question
https://datascience.stackexchange.com/questions/37022/intuition-importance-of-intermediate-supervision-in-deep-learning
but it does not answer my question sufficiently.
In my case, A and B are unrelated modules. In the aforementioned question, A and B would be identical. The targets would be the same, too.
(Additional information)
The reason why I'm asking is that I'm trying to understand LCNN (https://github.com/zhou13/lcnn) from this paper.
LCNN is made up of an Hourglass backbone, which then gets fed into MultiTask Learner (creates loss1), which in turn gets fed into a LineVectorizer Module (loss2). Both loss1 and loss2 are then summed up here and then back propagated through the entire network here.
Even though I've visited several deep learning lectures, I didn't know this was "allowed" or makes sense to do. I would have expected to use two loss.backward(), one for each loss. Or is the pytorch computational graph doing something magical here? LCNN converges and outperforms other neural networks which try to solve the same task.
Yes, It is "allowed" and also makes sense.
From the question, I believe you have understood most of it so I'm not going to details about why this multi-loss architecture can be useful. I think the main part that has made you confused is why does "loss1" back-propagate through "B"? and the answer is: It doesn't. The fact is that loss1 is calculated using this formula:
loss1 = SOME_FUNCTION(label, y_hat)
and y_hat(prediction1) is only dependent on layers before it. Hence, the gradient of this loss only flows through layers before this section (A) and not the ones after it (B). To better understand this, you could again check the mathematics of artificial neural networks. The loss2, on the other hand, back-propagates through all of the network (including part A). When you use a cumulative loss (Loss = loss1 + loss2), a framework like Pytorch will automatically follow the gradient of every predicted label to the first layer.
I am trying to understand how Keras actually computes the gradients of a custom loss in a general setting.
Normally losses are defined as a sum over the samples of independent contributions. This allows eventually a proper parallelisation in the computation of the gradients.
However, if I add a global non linearity on top of it, thus coupling the contribution of the individual samples, is Keras able to treat the differentiation properly?
In practice, is it actually minimising f(sum_i(x_i)) or computes it one sample at the time and thus reducing to sum_i(f(x_i))?
Below an example in the case of a log function.
def custom_loss(y_true,y_pred):
return K.log(1+K.mean((y_pred-y_true)*(y_pred-y_true)))
I have checked for documentation but I couldn't find any precise answer.
It minimizes whatever you tell it to minimize.
If you want to minimize the log of the whole sum, then apply the log after the sum.
If you want to minimize the log of each sample and sum later, then apply the log before the sum
def log_of_sum(y_true, y_pred):
return K.log(1 + K.mean(K.square(y_true-y_pred)))
def sum_of_logs(y_true, y_ored):
return K.mean(K.log(1 + K.square(y_true-y_pred)))
#mean is optional here - you can return all the samples and Keras will handle it
#returning all the samples allows other functions to work, like sample_weights
I have a neural network with three layers. I've tried using tanh and sigmoid functions for my activations and then the output layer is just a simple linear function (I'm trying to model a regression problem).
For some reason my model seems to have a hard cut off where it will never predict a value above some threshold (even though it should). What reason could there be for this?
Here is what predictions from the model look like (with sigmoid activations):
update:
With relu activation, and switching from gradient descent to Adam, and adding L2 regularization... the model predicts same value for every input...
A linear layer regressing a single value will have outputs of the form
output = bias + sum(kernel * inputs)
If inputs comes from a tanh, then -1 <= inputs <= 1, and hence
bias - sum(abs(kernel)) <= output <= bias + sum(abs(kernel))
If you want an unbounded output, consider using an unbounded activation on all intermediate layers, e.g. relu.
I think your problem concerns the generalization/expressiveness of the model. Regression is a basic task, there should be no problem with the method itself, but problem with the execution. #DomJack explained how output is restricted for a specific set of parameters, but that only happens for anomaly data. In general, when training parameters would be tuned so that it will predict output correctly.
So first point is about the quality of training data. Make sure you have large enough training data (and it is split randomly if you split train/test from one dataset). Also, maybe trivial, but make sure you didn't mess up input/output value in preprocessing.
Another point is about the size of the network. Make sure you use large enough hidden layer.
I have a CNN architecture, with consists of some layers -- convolution, fully-connected, and deconvolution -- (called it with first process). The last deconvolution layer gives me the points as the output and I need to do some processing (call it with second process) with this output to get the Loss value.
In the second process, I'm doing the tf.while_loop for calculating the Loss value, because the Loss value is achieved by adding all Loss values from each iteration in tf.while_loop. And I give the tf.constant(0) for the Loss initialization before looping.
When I tried to train and minimize that Loss, it shows me the error of No gradient provided between the output from first process and Loss tensor.
The second process looks like this:
loss = tf.constant(0)
i = tf.constant(0)
def cond(i, loss):
return tf.less(i, tf.size(xy))
def body(i, loss):
# xy is the output from the first process
xy = tf.cast(xy, tf.float32)
x = tf.reduce_mean(xy)
loss = tf.add(loss, x)
return [tf.add(i,1), loss]
r = tf.while_loop(cond, body, [i, loss])
optimizer.minimize(r[1])
And I also do some processing inside the second process which (I read from many posts, especially here) doesn't provide gradient.
Any help would be really appreciated.
There's several reasons why you will get that error. Without actually seeing your original code it might be hard to debug but here's at least two reasons why gradients aren't provided:
There are some tensorflow operations through which gradients cannot flow or back-propagation cannot occur. For example tf.cast or tf.assign etc. In the post that you linked, there's a comment that mentions this. So in the example you provided tf.cast will definitely cause an issue.
A solution to this problem would be to restructure your code in such a way that you don't use tensorflow operations that disallow gradients to pass through them.
A second reason why this might occur is when you try to optimize variables by using a loss that was not calculated on those variables. For example, if you calculated loss in your first process on conv1 variables, and then in your second process you try to update/optimize conv2 variables. This will not work since the gradients will be calculated for conv1 variables and not conv2.
It looks like in your case it is most likely the first issue and not the second one.
I am using Tensorflow DNNRegressor Estimator model for making a neural network. But calling estimator.train() function is giving output as follows:
I.e. my loss function is varying a lot with every step. But as far as I know, my loss function should decrease with no of iterations. Also, find the attached screenshot for Tensorboard Visualisation for loss function:
The doubts I'm not able to figure out are:
Whether it is overall loss function value (combined loss for every step processed till now) or just that step's loss value?
If it is that step's loss value, then how to get value of overall loss function and see its trend, which I feel should decrease with increasing no of iterations? And In my knowledge that is the value we should look at while training a dataset.
If this is overall loss value, then why is it fluctuating so much? Am I missing something?
First of all, let me point out that tf.contrib.learn.DNNRegressor uses a linear regression head with mean_squared_loss, i.e. simple L2 loss.
Whether it is overall loss function value (combined loss for every
step processed till now) or just that step's loss value?
Each point on a chart is the value of a loss function on the last step after learning so far.
If it is that step's loss value, then how to get value of overall loss
function and see its trend, which I feel should decrease with
increasing no of iterations?
There's no overall loss function, probably you mean a chart how the loss changed after each step. That's exactly what tensorboard is showing to you. You are right, its trend is not downwards, as it should. This indicates that your neural network is not learning.
If this is overall loss value, then why is it fluctuating so much? Am I missing something?
A common reason for the neural network not learning is poor choice of hyperparameters (though there are many more mistakes you can possibly make). For example:
the learning rate is too large
it's also possible that the learning rate is too small, which means that the neural network is learning, but very very slowly, so that you can't see it
weights initialization is probably too large, try to decrease it
batch size may be too large as well
you're passing wrong labels for the inputs
training data contains missing values, or unnormalized
...
What I usually do to check if the neural network is at least somehow working is reduce the training set to few examples and try to overfit the network. This experiment is very fast, so I can try various learning rates, initialization variance and other parameters to find a sweet spot. Once I have a steady decreasing loss chart, I go on with a bigger set.
Though previous comment is very informative and good, it doesn't quite address your issue. When you instantiate DNNRegressor, add:
loss_reduction=tf.losses.Reduction.MEAN
in the constructor, and you'll see your average loss, converges.
estimator = tf.estimator.DNNRegressor(
feature_columns=feat_clmns,
hidden_units=[32, 64, 32],
weight_column=weight_clmn,
**loss_reduction=tf.losses.Reduction.MEAN**