I am trying to modify a bit the loss function of my convent and I have some questions from the implementation side.
I already know how to create a custom loss function in Keras, and how to call it. But I still do not have clear where to include the derivative of the function.
Let's say that my new loss function is:
Loss = cross-entropy + f(x)
where f(x) = x**2.
Where should I include f'(x)=2x so that it is used in the back-prop step?
Does Keras automatically do that? Or should I define this explicitly in some part?
Thanks for any hint on this, since I do not know how to do it.
Chuan.
Loss must be a function of a) your networks output and b) correct labels.
Having loss = Summ(a,b) makes your network minimize both a) and b).
minimizing x**2 brings x close to zero;
minimizing softmax().. since softmax(x) is not a loss function, is defined only for a vector X, and helps make a vector summ up to 1, you cant really minimize it. I guess you are mixing concepts here.
Softmax is an activation function, and its output can be used to compute loss, eg. logloss
Related
I have two parameters which I want a neural network to predict. What is the best or most conventional method to implement the loss function? Currently I just define the loss, torch.nn.L1Loss(), which automatically computes the mean for both parameters such that it becomes a scalar.
Another plausible method would be to create two loss functions, one for each parameter, and successively backpropagate.
I don't really see whether both methods compute the same thing and whether one method is better (or plain wrong).
The probelm could be seen as a Multi-task Probelm. For example, two parameters represents A-Task and B-Task respectively.
In Multi-task, two loss function is often used.
The usual form is as follows,
$$total_loss = \alpha * A_losss(\hat{y_1},y_1) + \bata * A_losss(\hat{y_2},y_2)$$
The $\alpha$ and $\beta$ is the weight of the loss function.Usually they are both 1 or 0.5.
I have a neural network Network that has a vector output. Instead of using a typical loss function, I would like to implement my own loss function that is a method in some class. This looks something like:
class whatever:
def __init__(self, network, optimizer):
self.network = network
self.optimizer = optimizer
def cost_function(relevant_data):
...implementation of cost function with respect to output of network and relevant_data...
def train(self, epochs, other_params):
...part I'm having trouble with...
The main thing I'm concerned with is about taking gradients. Since I'm taking my own custom loss function, do I need to implement my own gradient with respect to the cost function?
Once I do the math, I realize that if the cost is J, then the gradient of J is a fairly simple function in terms of the gradient of the final layer of the Network. I.e, it looks something like: Equation link.
If I used some traditional loss function like CrossEntropy, my backprocess would look like:
objective = nn.CrossEntropyLoss()
for epochs:
optimizer.zero_grad()
output = Network(input)
loss = objective(output, data)
loss.backward()
optimizer.step()
But how do we do this in my case? My guess is something like:
for epochs:
optimizer.zero_grad()
output = Network(input)
loss = cost_function(output, data)
#And here is where the problem comes in
loss.backward()
optimizer.step()
loss.backward() as I understand it, takes the gradients of the loss function with respect to the parameters. But can I still invoke it while using my own loss function (presumably the program doesn't know what the gradient equation is). Do I have to implement another method/subroutine to find the gradients as well?
Which brings me to my other question: if I do want to implement gradient calculation for my loss function, I also need the gradient of the neural network parameters. How do I obtain those? Is there a function for that?
As long as all your steps starting from the input till the loss function involve differentiable operations on PyTorch's tensors, you need not do anything extra. PyTorch builds a computational graph that keeps track of each operation, its inputs, and gradients. So, calling loss.backward() on your custom loss would still propagate gradients back correctly through the graph. A Gentle Introduction to torch.autograd from the PyTorch tutorials may be a useful reference.
After the backward pass, if you need to directly access the gradients for further processing, you can do so using the .grad attribute (so t.grad for tensor t in the graph).
Finally, if you have a specific use case for finding the gradient of an arbitrary differentiable function implemented using PyTorch's tensors with respect to one of its inputs (e.g. gradient of the loss with respect to a particular weight in the network), you could use torch.autograd.grad.
I have asked a similar question but no response. So I try it again,
I am reading a paper which suggest to add some value which is calculated outside of Tensorflow into the loss function of a neural network model in Tensorflow. i show you the quote here (I have blurred the not important part):
How do I add a precalculated value to the loss function when fitting a sequential Model in Tensorflow?
The Loss function used is BinaryCrossentropy, you can see it in the equation (4) in the paper quote. And the value added is shown in the quote but it is not important for the question i think.
It is also not important how my model looks like, i just want to add a constant value to my loss function in tensorflow when fitting my model.
Thank you very much!!
In the equation above as you can see, there can a chance when the outcome is very low i.e. the problem of vanishing gradient may occur.
In order to alleviate that, they are asking to add a constant value to the loss.
Now, you can a simple constant such 1, 10 or anything, or by something proportional to what they have said.
You can easily calculate the expectation from the ground truth for one part. The other part is the tricky one as you won't have values until you train and calculating them on the fly is not wise.
That term means how much difference between the ground truth and predictions will be there.
So, if you are going to implement this paper, then, add a constant value of 1 to your loss, so it doesn't vanish.
It seems that you want to be able to define your own loss. Also, I am not sure whether you use actual Tensorflow or Keras. Here is a solution with Keras:
import tensorflow.keras.backend as K
def my_custom_loss(precomputed_value):
def loss(y_true, y_pred):
return K.binary_crossentropy(y_true, y_pred) + precomputed_value
return loss
my_model = Sequential()
my_model.add(...)
# Add any layer there
my_model.compile(loss=my_custom_loss(42))
Inspired from https://towardsdatascience.com/advanced-keras-constructing-complex-custom-losses-and-metrics-c07ca130a618
EDIT: The answer was only for adding a constant term, but I realize that the term suggested in the paper is not constant.
I haven't read the paper, but I suppose from the cross-entropy definition that sigma is the ground truth and p is the predicted value. If there are no other dependency, the solution can even be simpler:
def my_custom_loss(y_pred, y_true):
norm_term = K.square( K.mean(y_true) - K.mean(y_pred) )
return K.binary_crossentropy(y_true, y_pred) + norm_term
# ...
my_model.compile(loss=my_custom_loss)
Here, I assumed the expectations are only computed on each batch. Tell me whether it is what you want. Otherwise, if you want to compute your statistics at a different scale, e.g. on the whole dataset after every epoch, you might need to use callbacks.
In that case, please give more precision on your problem, adding for instance a small example for y_pred and y_true, and the expected loss.
I am trying to understand how Keras actually computes the gradients of a custom loss in a general setting.
Normally losses are defined as a sum over the samples of independent contributions. This allows eventually a proper parallelisation in the computation of the gradients.
However, if I add a global non linearity on top of it, thus coupling the contribution of the individual samples, is Keras able to treat the differentiation properly?
In practice, is it actually minimising f(sum_i(x_i)) or computes it one sample at the time and thus reducing to sum_i(f(x_i))?
Below an example in the case of a log function.
def custom_loss(y_true,y_pred):
return K.log(1+K.mean((y_pred-y_true)*(y_pred-y_true)))
I have checked for documentation but I couldn't find any precise answer.
It minimizes whatever you tell it to minimize.
If you want to minimize the log of the whole sum, then apply the log after the sum.
If you want to minimize the log of each sample and sum later, then apply the log before the sum
def log_of_sum(y_true, y_pred):
return K.log(1 + K.mean(K.square(y_true-y_pred)))
def sum_of_logs(y_true, y_ored):
return K.mean(K.log(1 + K.square(y_true-y_pred)))
#mean is optional here - you can return all the samples and Keras will handle it
#returning all the samples allows other functions to work, like sample_weights
For a standard machine learning problem, e.g, image classification on MNIST, the loss function is fixed, therefor the optimization process can be accomplished simply by calling functions and feed the input into them. There is no need to derive gradients and code the descent procedure by hand.
But now I'm confused when I met some complicated formulation. Say we are solving a semi-supervised problem, and the loss function has two parts:Ls + lambda * Lu. The first part is a normal classification formulation, e.g, cross entropy loss. And the second part varies. In my situation, Lu is a matrix factorization loss, which in specific is:Lu = MF(D, C * W). And the total loss function can be written as:
L = \sum log p(yi|xi) + MF(D, C * W)
= \sum log p(yi|Wi) + MF(D, C * W)
= \sum log p(yi|T * Wi + b) + MF(D, C * W)
Where parameters are W, C, T and b. The first part is a classification loss, and the input xi is a raw of W, i.e. Wi, a vector of size (d, 1). And the label yi can be a one-hot vector of size (c, 1), so parameters T and b map the input to the label size. And the second part is a matrix factorization loss.
Now I'm confused when I'm going to optimize this function using sgd. It can be solved by write down the formulation derive gradients then accomplish a training procedure from scratch. But I'm wondering if there is a simpler way? Because it's easy to use a deep learning tool like Tensorflow or Keras to train a classification model, all u need to do is build a network and feed the data.
So similarly, is there a tool that can automatically compute gradients after I defined the loss function? Because deriving gradients and achieve them from scratch is really annoying. Both the classification loss and matrix factorization loss is very common, so I think the combination can be achieved thoroughly.
Theano and Tensorflow will exactly do this for you if you can formulate your optimization-problem within their framework / language. These frameworks are also general enough to implement non-NN-based algorithms, like simple first-order-based optimizations like yours.
If that's not possible you can try autograd, which can do this on a subset of numpy. Just formulate your loss as numpy-function (while sticking to supported functions; read the docs) and let autograd build the gradients.
Keep in mind, that the somewhat by-construction-approach used by Theano & Tensorflow will be more efficient (because of the more defined input and because these two libraries are a bit more evolved).
Both Theano & Tensorflow have built-in differentiation for you. So you only need to form the loss.