I want to compute sum of cross entropy over all classes for each prediction, where the input is batch (size n), and the output is batch (size n).
The simplest way is for loop (for 1000 classes):
def sum_of_CE_lost(input):
L = 0
for c in range(1000):
L = L + torch.nn.CrossEntropyLoss(input, c)
return L
However, it is very slow. What is a better way? How can we parallelized it for GPU (CUDA)?
First of all, to make it faster, you need to vectorize it, that is, work with matrices.
So, image you have 1,000 samples to compute the loss. Also, your classification problem has 5 labels. To compute the CrossEntropyLoss we need an input and a target. Let's simulate that as follows:
loss = nn.CrossEntropyLoss() # the loss function
input = torch.randn(1000, 5) #1000 samples and 5 labels' predictions
target = torch.empty(1000, dtype=torch.long).random_(5) # 1000 samples with labels from 0 to 4
loss_value = loss(input, target) # It'll output the loss
There we go! Now the loss is computed considering the 1,000 samples. This is the fatest way to do that.
I found the answer:
torch.nn.functional.log_softmax (input).sum() / input.shape[0]
We divide by input.shape[0] because cross_entropy() takes, by default the mean across the batch dimension.
Related
I have an output tensor (both target and predicted) of dimension (32 x 8 x 5000). Here, the batch size is 32, the number of classes is 5000 and the number of points per batch is 8. I want to calculate CELoss on this in such a way that, the loss is computed for every point (across 5000 classes) and then averaged across the 8 points. How can I do this?
For clarity, there are 32 batch points in a batch (for bs=32). Each batch point has 8 vector points, and each vector point has 5000 classes. For a given batch, I wish to compute CELoss across all (8) vector points, compute their average and do so for all the batch points (32).
Let me know if my question isn’t clear or ambiguous.
For example:
op = torch.rand((4,3,5))
gt = torch.tensor([
[[0,1,1,0,0],[0,0,1,0,0],[1,1,0,0,1]],
[[1,1,0,0,1],[0,0,0,1,0],[0,0,1,0,0]],
[[0,0,1,0,0],[1,1,1,1,0],[1,1,0,0,1]],
[[1,1,0,0,1],[1,1,0,0,1],[1,0,0,0,0]]
])
DATA
op = torch.rand((4,3,5))
gt = torch.tensor([
[[0,1,1,0,0],[0,0,1,0,0],[1,1,0,0,1]],
[[1,1,0,0,1],[0,0,0,1,0],[0,0,1,0,0]],
[[0,0,1,0,0],[1,1,1,1,0],[1,1,0,0,1]],
[[1,1,0,0,1],[1,1,0,0,1],[1,0,0,0,0]]
], dtype=torch.float)
Now, if your output is in [0,1] (if it is not please provide a Sigmoid activation at the end of your model) you can compute the binary cross-entropy losses (N_class values for each point of each element) in this way:
torch.nn.BCELoss(reduction="none")(op, gt)
You can finally compute the average loss for each element of batch as:
torch.nn.BCELoss(reduction="none")(op, gt).mean(dim=[-1,-2])
If it is not the solution you are looking for or it is not clear let me know.
I'm trying to get a better understanding of how Gradient Accumulation works and why it is useful. To this end, I wanted to ask what is the difference (if any) between these two possible PyTorch-like implementations of a custom training loop with gradient accumulation:
gradient_accumulation_steps = 5
for batch_idx, batch in enumerate(dataset):
x_batch, y_true_batch = batch
y_pred_batch = model(x_batch)
loss = loss_fn(y_true_batch, y_pred_batch)
loss.backward()
if (batch_idx + 1) % gradient_accumulation_steps == 0: # (assumption: the number of batches is a multiple of gradient_accumulation_steps)
optimizer.step()
optimizer.zero_grad()
y_true_batches, y_pred_batches = [], []
gradient_accumulation_steps = 5
for batch_idx, batch in enumerate(dataset):
x_batch, y_true_batch = batch
y_pred_batch = model(x_batch)
y_true_batches.append(y_true_batch)
y_pred_batches.append(y_pred_batch)
if (batch_idx + 1) % gradient_accumulation_steps == 0: # (assumption: the number of batches is a multiple of gradient_accumulation_steps)
y_true = stack_vertically(y_true_batches)
y_pred = stack_vertically(y_pred_batches)
loss = loss_fn(y_true, y_pred)
loss.backward()
optimizer.step()
optimizer.zero_grad()
y_true_batches.clear()
y_pred_batches.clear()
Also, kind of as an unrelated question: Since the purpose of gradient accumulation is to mimic a larger batch size in cases where you have memory constraints, does it mean that I should also increase the learning rate proportionally?
1. The difference between the two programs:
Conceptually, your two implementations are the same: you forward gradient_accumulation_steps batches for each weight update.
As you already observed, the second method requires more memory resources than the first one.
There is, however, a slight difference: usually, loss functions implementation use mean to reduce the loss over the batch. When you use gradient accumulation (first implementation) you reduce using mean over each mini-batch, but using sum over the accumulated gradient_accumulation_steps mini-batches. To make sure the accumulated gradient implementation is identical to large batches implementation you need to be very careful in the way the loss function is reduced. In many cases you will need to divide the accumulated loss by gradient_accumulation_steps. See this answer for a detailed imlpementation.
2. Batch size and learning rate:
Learning rate and batch size are indeed related. When increasing the batch size one usually reduces the learning rate.
See, e.g.:
Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le, Don't Decay the Learning Rate, Increase the Batch Size (ICLR 2018).
I am training a BERT model on a relatively small dataset and cannot afford to lose any labelled sample as they must all be used for training. Due to GPU memory constraints, I am using gradient accumulation to train on larger batches (e.g. 32). According to PyTorch documentation, gradient accumulation is implemented as follows:
scaler = GradScaler()
for epoch in epochs:
for i, (input, target) in enumerate(data):
with autocast():
output = model(input)
loss = loss_fn(output, target)
loss = loss / iters_to_accumulate
# Accumulates scaled gradients.
scaler.scale(loss).backward()
if (i + 1) % iters_to_accumulate == 0:
# may unscale_ here if desired (e.g., to allow clipping unscaled gradients)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
However, if you are using e.g. 110 training samples, with batch size 8 and accumulation step 4 (i.e. effective batch size 32), this method would only train the first 96 samples (i.e. 32 x 3), i.e. wasting 14 samples. In order to avoid this, I'd like to modify the code as follows (notice change to the final if statement):
scaler = GradScaler()
for epoch in epochs:
for i, (input, target) in enumerate(data):
with autocast():
output = model(input)
loss = loss_fn(output, target)
loss = loss / iters_to_accumulate
# Accumulates scaled gradients.
scaler.scale(loss).backward()
if (i + 1) % iters_to_accumulate == 0 or (i + 1) == len(data):
# may unscale_ here if desired (e.g., to allow clipping unscaled gradients)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
Is this correct and really that simple, or will this have any side effects? It seems very simple to me, but I've never seen it done before. Any help appreciated!
As Lucas Ramos already mentioned, when using DataLoader where the underlying dataset's size is not divisible by the batch size, the default behavior is to have a smaller last batch:
drop_last (bool, optional) – set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)
Your plan is basically implementing gradient accumulation combined with drop_last=False - that is having the last batch smaller than all others.
Therefore, in principle there's nothing wrong with training with varying batch sizes.
However, there is something you need to fix in your code:
The loss is averaged over the mini-batch. So, if you process mini batches in the usual way you do not need to worry about it. However, when accumulating gradients you do it explicitly by dividing the loss by iters_to_accumulate:
loss = loss / iters_to_accumulate
In the last mini batch (with smaller size) you need to change the value of iter_to_accumulate to reflect this smaller minibatch size!
I proposed this revised code, breaking the training loop into two: an outer loop on mini-batches, and an inner one that accumulates gradients per mini batch. Note how using an iter over the DataLoader helps breaking the training loop into two:
scaler = GradScaler()
for epoch in epochs:
bi = 0 # index batches
# outer loop over minibatches
data_iter = iter(data)
while bi < len(data):
# determine the range for this batch
nbi = min(len(data), bi + iters_to_accumulate)
# inner loop over the items of the mini batch - accumulating gradients
for i in range(bi, nbi):
input, target = data_iter.next()
with autocast():
output = model(input)
loss = loss_fn(output, target)
loss = loss / (nbi - bi) # divide by the true batch size
# Accumulates scaled gradients.
scaler.scale(loss).backward()
# done mini batch loop - gradients were accumulated, we can make an optimizatino step.
# may unscale_ here if desired (e.g., to allow clipping unscaled gradients)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
bi = nbi
I was pretty sure I've seen this done before. Check out this code from Pytorch Lightning (functions _accumulated_batches_reached, _num_training_batches_reached and should_accumulate).
I have two tensors that I am calculating the Spearmans Rank Correlation from, and I would like to be able to have PyTorch automatically adjust the values in these Tensors in a way that increases my Spearmans Rank Correlation number as high as possible.
I have explored autograd but nothing I've found has explained it simply enough.
Initialized tensors:
a=Var(torch.randn(20,1),requires_grad=True)
psfm_s=Var(torch.randn(12,20),requires_grad=True)
How can I have a loop of constant adjustments of the values in these two tensors to get the highest spearmans rank correlation from 2 lists I make from these 2 tensors while having PyTorch do the work? I just need a guide of where to go. Thank you!
I'm not familiar with Spearman's Rank Correlation, but if I understand your question you're asking how to use PyTorch to solve problems other than deep networks?
If that's the case then I'll provide a simple least squares example which I believe should be informative to your effort.
Consider a set of 200 measurements of 10 dimensional vectors x and y. Say we want to find a linear transform from x to y.
The least squares approach dictates we can accomplish this by finding the matrix M and vector b which minimize |(y - (M x+b))²|
The following example code generates some example data and then uses pytorch to perform this minimization. I believe the comments are sufficient to help you understand what is occurring here.
import torch
from torch.nn.parameter import Parameter
from torch import optim
# define some fake data
M_true = torch.randn(10, 10)
b_true = torch.randn(10, 1)
x = torch.randn(200, 10, 1)
noise = torch.matmul(M_true, 0.05 * torch.randn(200, 10, 1))
y = torch.matmul(M_true, x) + b_true + noise
# begin optimization
# define the parameters we want to optimize (using random starting values in this case)
M = Parameter(torch.randn(10, 10))
b = Parameter(torch.randn(10, 1))
# define the optimizer and provide the parameters we want to optimize
optimizer = optim.SGD((M, b), lr=0.1)
for i in range(500):
# compute loss that we want to minimize
y_hat = torch.matmul(M, x) + b
loss = torch.mean((y - y_hat)**2)
# zero the gradients of the parameters referenced by the optimizer (M and b)
optimizer.zero_grad()
# compute new gradients
loss.backward()
# update parameters M and b
optimizer.step()
if (i + 1) % 100 == 0:
# scale learning rate by factor of 0.9 every 100 steps
optimizer.param_groups[0]['lr'] *= 0.9
print('step', i + 1, 'mse:', loss.item())
# final parameter values (data contains a torch.tensor)
print('Resulting parameters:')
print(M.data)
print(b.data)
print('Compare to the "real" values')
print(M_true)
print(b_true)
Of course this problem has a simple closed form solution, but this numerical approach is just to demonstrate how to use PyTorch's autograd to solve problems not necessarily neural network related. I also choose to explicitly define the matrix M and vector b here rather than using an equivalent nn.Linear layer since I think that would just confuse things.
In your case you want to maximize something so make sure to negate your objective function before calling backward.
When I use keras's binary_crossentropy as the loss function (that calls tensorflow's sigmoid_cross_entropy, it seems to produce loss values only between [0, 1]. However, the equation itself
# The logistic loss formula from above is
# x - x * z + log(1 + exp(-x))
# For x < 0, a more numerically stable formula is
# -x * z + log(1 + exp(x))
# Note that these two expressions can be combined into the following:
# max(x, 0) - x * z + log(1 + exp(-abs(x)))
# To allow computing gradients at zero, we define custom versions of max and
# abs functions.
zeros = array_ops.zeros_like(logits, dtype=logits.dtype)
cond = (logits >= zeros)
relu_logits = array_ops.where(cond, logits, zeros)
neg_abs_logits = array_ops.where(cond, -logits, logits)
return math_ops.add(
relu_logits - logits * labels,
math_ops.log1p(math_ops.exp(neg_abs_logits)), name=name)
implies that the range is from [0, infinity). So is Tensorflow doing some sort of clipping that I'm not catching? Moreover, since it's doing math_ops.add() I'd assume it'd be for sure greater than 1. Am I right to assume that loss range can definitely exceed 1?
The cross entropy function is indeed not bounded upwards. However it will only take on large values if the predictions are very wrong. Let's first look at the behavior of a randomly initialized network.
With random weights, the many units/layers will usually compound to result in the network outputing approximately uniform predictions. That is, in a classification problem with n classes you will get probabilities of around 1/n for each class (0.5 in the two-class case). In this case, the cross entropy will be around the entropy of an n-class uniform distribution, which is log(n), under certain assumptions (see below).
This can be seen as follows: The cross entropy for a single data point is -sum(p(k)*log(q(k))) where p are the true probabilities (labels), q are the predictions, k are the different classes and the sum is over the classes. Now, with hard labels (i.e. one-hot encoded) only a single p(k) is 1, all others are 0. Thus, the term reduces to -log(q(k)) where k is now the correct class. If with a randomly initialized network q(k) ~ 1/n, we get -log(1/n) = log(n).
We can also go of the definition of the cross entropy which is generally entropy(p) + kullback-leibler divergence(p,q). If p and q are the same distributions (e.g. p is uniform when we have the same number of examples for each class, and q is around uniform for random networks) then the KL divergence becomes 0 and we are left with entropy(p).
Now, since the training objective is usually to reduce cross entropy, we can think of log(n) as a kind of worst-case value. If it ever gets higher, there is probably something wrong with your model. Since it looks like you only have two classes (0 and 1), log(2) < 1 and so your cross entropy will generally be quite small.