Trying to understand PyTorch SmoothL1Loss Implementation

Trying to understand PyTorch SmoothL1Loss Implementation - python

I have been trying to go through all of the loss functions in PyTorch and build them from scratch to gain a better understanding of them and I’ve run into what is either an issue with my recreation, or an issue with PyTorch’s implementation.
According to Pytorch’s documentation for SmoothL1Loss it simply states that if the absolute value of the prediction minus the ground truth is less than beta, we use the top equation. Otherwise, we use the bottom one. Please see documentation for the equations.
Below is my implementation of this in the form of a minimum test:
import torch
import torch.nn as nn
import numpy as np
predictions = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
def l1_loss_smooth(predictions, targets, beta = 1.0):
loss = 0
for x, y in zip(predictions, targets):
if abs(x-y).mean() < beta:
loss += (0.5*(x-y)**2 / beta).mean()
else:
loss += (abs(x-y) - 0.5 * beta).mean()
loss = loss/predictions.shape[0]
output = l1_loss_smooth(predictions, target)
print(output)
Gives an output of:
tensor(0.7475, grad_fn=<DivBackward0>)
Now the Pytorch implementation:
loss = nn.SmoothL1Loss(beta=1.0)
output = loss(predictions, target)
Gives an output of:
tensor(0.7603, grad_fn=<SmoothL1LossBackward>)
I can’t figure out where the error in implementation lies.
Upon looking a little deeper into the smooth_l1_loss function in the _C module (file: smooth_c_loss_op.cc) I noticed that the doc string mentions that it’s a variation on Huber Loss but the documentation for SmoothL1Loss says it is Huber Loss.
So overall, just confused on how it’s implemented and whether it’s a combo of SmoothL1Loss and Huber Loss, Just Huber Loss, or something else.

The description in the documentation is correct. Your implementation wrongly applies the case selection on the mean of the data. It should be an element-wise selection instead (if you think about the implementation of the vanilla L1 loss, and the motivation for smooth L1 loss).
The following code gives a consistent result:
import torch
import torch.nn as nn
import numpy as np
predictions = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
def l1_loss_smooth(predictions, targets, beta = 1.0):
loss = 0
diff = predictions-targets
mask = (diff.abs() < beta)
loss += mask * (0.5*diff**2 / beta)
loss += (~mask) * (diff.abs() - 0.5*beta)
return loss.mean()
output = l1_loss_smooth(predictions, target)
print(output)
loss = nn.SmoothL1Loss(beta=1.0)
output = loss(predictions, target)
print(output)

Related

What is the correct way to maximize one loss and minimize another during NN training?

I have a simple NN:
import torch
import torch.nn as nn
import torch.optim as optim
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.fc1 = nn.Linear(1, 5)
self.fc2 = nn.Linear(5, 10)
self.fc3 = nn.Linear(10, 1)
def forward(self, x):
x = self.fc1(x)
x = torch.relu(x)
x = torch.relu(self.fc2(x))
x = self.fc3(x)
return x
net = Model()
opt = optim.Adam(net.parameters())
I also have some input features:
features = torch.rand((3,1))
I can train it normally with a simple loss function that will be minimized:
for i in range(10):
opt.zero_grad()
out = net(features)
loss = torch.mean(torch.square(torch.tensor(5) - torch.sum(out)))
print('loss:', loss)
loss.backward()
opt.step()
However, if I'll add another loss component to this that I'd want to maximize--loss2:
loss2s = []
for i in range(10000):
opt.zero_grad()
out = net(features)
loss1 = torch.mean(torch.square(torch.tensor(5) - torch.sum(out)))
loss2 = torch.sum(torch.tensor([torch.sum(w_arr) for w_arr in net.parameters()]))
loss2s.append(loss2)
loss = loss1 + loss2
loss.backward()
opt.step()
It becomes seemingly unstable as the 2 losses have different scales. Also, I'm not sure that this is the correct way because how would the loss know to maximize one part and minimize the other. Note that this is just an example, obviously there's no point in increasing the weights.
import matplotlib.pyplot as plt
plt.plot(loss2s, c='r')
plt.plot(loss1s, c='b')
And also I believe that minimizing functions is the common way to train in ML, so I wasn't sure if changing the maximization problem into minimization problem in some way will be better.

The standard way to denote "minimization" and "maximization" is changing the sign. PyTorch always minimizes a loss if the following is done
loss.backward()
So, if another loss2 needs to be maximized, we add negative of it
overall_loss = loss + (- loss2)
overall_loss.backward()
since minimizing a negative quantity is equivalent to maximizing the original positive quantity.
With regard to "scale", yes scales do matter. Often the following is done in order to match scales
overall_loss = loss + alpha * (- loss2)
where alpha is a fraction denoting relative importance of one loss w.r.t to the other. Its a hyperparameter and needs to experimented with.
Keeping technicalities aside, whether the resulting loss will be stable depends a lot on the specific problem and loss functions involved. If the losses are contradicting, you may experience instability. The ways to deal them is itself a research problem and much beyond the scope of this question.

Training with parametric partial derivatives in pytorch

Given a neural network with weights theta and inputs x, I am interested in calculating the partial derivatives of the neural network's output w.r.t. x, so that I can use the result when training the weights theta using a loss depending both on the output and the partial derivatives of the output. I figured out how to calculate the partial derivatives following this post. I also found this post that explains how to use sympy to achieve something similar, however, adapting it to a neural network context within pytorch seems like a huge amount of work and a recipee for very slow code.
Thus, I tried something different, which failed. As a minimal example, I created a function (substituting my neural network)
theta = torch.ones([3], requires_grad=True, dtype=torch.float32)
def trainable_function(time):
return theta[0]*time**3 + theta[1]*time**2 + theta[2]*time
Then, I defined a second function to give me partial derivatives:
def trainable_derivative(time):
deriv_time = torch.tensor(time, requires_grad=True)
fun_value = trainable_function(deriv_time)
gradient = torch.autograd.grad(fun_value, deriv_time, create_graph=True, retain_graph=True)
deriv_time.requires_grad = False
return gradient
Given some noisy observations of the derivatives, I now try to train theta. For simplicity, I create a loss that only depends on the derivatives. In this minimal example, the derivatives are used directly as observations, not as regularization, to avoid complicated loss functions that are besides the point.
def objective(train_times, observations):
predictions = torch.squeeze(torch.tensor([trainable_derivative(a) for a in train_times]))
return torch.sum((predictions - observations)**2)
optimizer = Adam([theta], lr=0.1)
for iteration in range(200):
optimizer.zero_grad()
loss = objective(data_times, noisy_targets)
loss.backward()
optimizer.step()
Unfortunately, when running this code, I get the error
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
I suppose that when calculating the partial derivatives in the way I do, I do not really create a computational graph through which autodiff could differentiate through. Thus, the connection to the parameters theta somehow gets lost and now it looks to the optimizer as if the loss is completely independent of the parameters theta. However, I could be totally wrong..
Does anyone know how to fix this?
Is it possible to include this type of derivatives in the loss function in pytorch?
And if so, what would be the most pytorch-style way of doing this?
Many thanks for your help and advise, it is much appreciated.
For completeness:
To run the above code, some training data needs to be generated. I used the following code, which works perfectly and has been tested against the analytical derivatives:
true_a = 1
true_b = 1
true_c = 1
def true_function(time):
return true_a*time**3 + true_b*time**2 + true_c*time
def true_derivative(time):
deriv_time = torch.tensor(time, requires_grad=True)
fun_value = true_function(deriv_time)
return torch.autograd.grad(fun_value, deriv_time)
data_times = torch.linspace(0, 1, 500)
true_targets = torch.squeeze(torch.tensor([true_derivative(a) for a in data_times]))
noisy_targets = torch.tensor(true_targets) + torch.randn_like(true_targets)*0.1

Your approach to the problem appears overly complicated.
I believe that what you're trying to achieve is within reach in PyTorch.
I include here a simple code snippet that I believe showcases what you would like to do:
import torch
import torch.nn as nn
# Data and Function
torch.manual_seed(0)
input_dim = 1
output_dim = 2
n = 10 # batchsize
simple_function = nn.Sequential(nn.Linear(1, 2), nn.Sigmoid())
t = (torch.arange(n).float() / n).view(n, 1)
x = torch.randn(n, output_dim)
t.requires_grad = True
# Actual computation
xhat = simple_function(t)
jac = torch.autograd.functional.jacobian(simple_function, t, create_graph=True)
grad = jac[torch.arange(n),:,torch.arange(n),0]
loss = (x -xhat).pow(2).sum() + grad.pow(2).sum()
loss.backward()

For a classification model in tensorflow, is there a way to impose an asymmetric cost function during the training?

I am trying to build a Neural Network in tensorflow where the cost of a Type I error (false-positive) is more costly than a Type II error (false-negative). Is there a way to impose this during the training process (i.e. inputting a cost matrix)? This is possible with simple models like Logistic Regression in scikit learn by specifying the class_weight parameter.
cw = {0: 3,1:1}
clf = LogisticRegression(class_weight = cw )
In this case, incorrectly predicting a 0 is 3x more costly than incorrectly predicting a 1. However, this cannot be performed with a Neural Network, so I want to see if it is possible in tensorflow.
Thanks

You could use tf.nn.weighted_cross_entropy_with_logits and it's pos_weight argument.
This argument weights positive class, as described by documentation (in TF2.0 at least):
A value pos_weights > 1 decreases the false negative count, hence increasing the recall.
Conversely setting pos_weights < 1 decreases the false positive count and increases the precision.
In your case, you could create custom loss function like this:
import tensorflow as tf
# Output logits from your network, not the values after sigmoid activation
class WeightedBinaryCrossEntropy:
def __init__(self, positive_weight: float):
self.positive_weight = positive_weight
def __call__(self, targets, logits, sample_weight=None):
return tf.nn.weighted_cross_entropy_with_logits(
targets, logits, pos_weight=self.positive_weight
)
And create a custom neural network with it, for example using tf.keras (samples are weighted as they were in your question:
import numpy as np
model = tf.keras.models.Sequential(
[
tf.keras.layers.Dense(32, input_shape=(10,)),
tf.keras.layers.Activation("relu"),
tf.keras.layers.Dense(10),
tf.keras.layers.Activation("relu"),
# Output one logit for binary classification
tf.keras.layers.Dense(1),
]
)
# Example random data
data = np.random.random((32, 10))
targets = np.random.randint(2, size=32)
# 3 times as costly to make type I error
model.compile(optimizer="rmsprop", loss=WeightedBinaryCrossEntropy(positive_weight=3))
model.fit(data, targets, batch_size=32)

You can use a logarithmic scale. For a 0 incorrectly predicted as 1, y - ŷ = -1, log goes to 1.71. For a 1 predicted as 0, y - ŷ = 1 log equals 0.63. For y == ŷ log equals 0. Almost the three times more costly, for a 0 incorrectly predicted as 1.
import numpy as np
from math import exp
loss=abs(1-exp(-np.log(exp(y-ŷ))))
#abs(1-exp(-np.log(exp(0))))
#Out[53]: 0.0
#abs(1-exp(-np.log(exp(-1))))
#Out[54]: 1.718281828459045
#abs(1-exp(-np.log(exp(1))))
#Out[55]: 0.6321205588285577
Then you will have a convex optimization. Implementing:
import keras.backend as K
def custom_loss(y_true,y_pred):
return K.mean(abs(1-exp(-np.log(exp(y_true-y_pred)))))
Then:
model.compile(loss=custom_loss, optimizer=sgd,metrics = ['accuracy'])

Adjust custom loss function for gradient boosting classification

I have implemented a gradient boosting decision tree to do a mulitclass classification. My custom loss functions look like this:
import numpy as np
from sklearn.preprocessing import OneHotEncoder
def softmax(mat):
res = np.exp(mat)
res = np.multiply(res, 1/np.sum(res, axis=1, keepdims=True))
return res
def custom_asymmetric_objective(y_true, y_pred_encoded):
pred = y_pred_encoded.reshape((-1, 3), order='F')
pred = softmax(pred)
y_true = OneHotEncoder(sparse=False,categories='auto').fit_transform(y_true.reshape(-1, 1))
grad = (pred - y_true).astype("float")
hess = 2.0 * pred * (1.0-pred)
return grad.flatten('F'), hess.flatten('F')
def custom_asymmetric_valid(y_true, y_pred_encoded):
y_true = OneHotEncoder(sparse=False,categories='auto').fit_transform(y_true.reshape(-1, 1)).flatten('F')
margin = (y_true - y_pred_encoded).astype("float")
loss = margin*10
return "custom_asymmetric_eval", np.mean(loss), False
Everything works, but now I want to adjust my loss function in the following way: It should "penalize" if an item is classified incorrectly, and a penalty should be added for a certain constraint (this is calculated before, let's just say the penalty is e.g. 0,05, so just a real number).
Is there any way to consider both, the misclassification and the penalty value?

Try L2 regularization: weights will be updated following the subtraction of a learning rate times error times x plus the penalty term lambda weight to the power of 2
Simplifying:
This will be the effect:
ADDED: The penalization term (on the right of equation) increases the generalization power of your model. So, if you overfit your model in training set, the perfomance will be poor in test set. So, you penalize these "right" classifications in training set that generate error in test set and compromise generalization.

Why is my implementations of the log-loss (or cross-entropy) not producing the same results?

I was reading up on log-loss and cross-entropy, and it seems like there are 2 approaches for calculating it, based on the following equations.
The first one is the following.
import numpy as np
from sklearn.metrics import log_loss
def cross_entropy(predictions, targets):
N = predictions.shape[0]
ce = -np.sum(targets * np.log(predictions)) / N
return ce
predictions = np.array([[0.25,0.25,0.25,0.25],
[0.01,0.01,0.01,0.97]])
targets = np.array([[1,0,0,0],
[0,0,0,1]])
x = cross_entropy(predictions, targets)
print(log_loss(targets, predictions), 'our_answer:', ans)
The output of the previous program is 0.7083767843022996 our_answer: 0.71355817782, which is almost the same. So that's not the issue.
The above implementation is the middle part of the equation above.
The second approach is based on the RHS part of the equation above.
res = 0
for act_row, pred_row in zip(targets, np.array(predictions)):
for class_act, class_pred in zip(act_row, pred_row):
res += - class_act * np.log(class_pred) - (1-class_act) * np.log(1-class_pred)
print(res/len(targets))
And the output is 1.1549753967602232, which is not quite the same.
I have tried the same implementation with NumPy, but it also didn't work. What am I doing wrong?
PS: I am also curious that -y log (y_hat) seems to me that it's same as - sigma(p_i * log( q_i)) then how come there is a -(1-y) log(1-y_hat) part. Clearly I am misunderstanding how -y log (y_hat) is to be calculated.

I cannot reproduce the difference in the results you report in the first part (you also refer to an ans variable, which you do not seem to define, I guess it is x):
import numpy as np
from sklearn.metrics import log_loss
def cross_entropy(predictions, targets):
N = predictions.shape[0]
ce = -np.sum(targets * np.log(predictions)) / N
return ce
predictions = np.array([[0.25,0.25,0.25,0.25],
[0.01,0.01,0.01,0.97]])
targets = np.array([[1,0,0,0],
[0,0,0,1]])
The results:
cross_entropy(predictions, targets)
# 0.7083767843022996
log_loss(targets, predictions)
# 0.7083767843022996
log_loss(targets, predictions) == cross_entropy(predictions, targets)
# True
Your cross_entropy function seems to work fine.
Regarding the second part:
Clearly I am misunderstanding how -y log (y_hat) is to be calculated.
Indeed, reading more carefully the fast.ai wiki you have linked to, you'll see that the RHS of the equation holds only for binary classification (where always one of y and 1-y will be zero), which is not the case here - you have a 4-class multinomial classification. So, the correct formulation is
res = 0
for act_row, pred_row in zip(targets, np.array(predictions)):
for class_act, class_pred in zip(act_row, pred_row):
res += - class_act * np.log(class_pred)
i.e. discarding the subtraction of (1-class_act) * np.log(1-class_pred).
Result:
res/len(targets)
# 0.7083767843022996
res/len(targets) == log_loss(targets, predictions)
# True
On a more general level (the mechanics of log loss & accuracy for binary classification), you may find this answer useful.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trying to understand PyTorch SmoothL1Loss Implementation - python

Related

What is the correct way to maximize one loss and minimize another during NN training?

Training with parametric partial derivatives in pytorch

For a classification model in tensorflow, is there a way to impose an asymmetric cost function during the training?

Adjust custom loss function for gradient boosting classification

Why is my implementations of the log-loss (or cross-entropy) not producing the same results?

Categories

Resources