I have an output tensor (both target and predicted) of dimension (32 x 8 x 5000). Here, the batch size is 32, the number of classes is 5000 and the number of points per batch is 8. I want to calculate CELoss on this in such a way that, the loss is computed for every point (across 5000 classes) and then averaged across the 8 points. How can I do this?
For clarity, there are 32 batch points in a batch (for bs=32). Each batch point has 8 vector points, and each vector point has 5000 classes. For a given batch, I wish to compute CELoss across all (8) vector points, compute their average and do so for all the batch points (32).
Let me know if my question isn’t clear or ambiguous.
For example:
op = torch.rand((4,3,5))
gt = torch.tensor([
[[0,1,1,0,0],[0,0,1,0,0],[1,1,0,0,1]],
[[1,1,0,0,1],[0,0,0,1,0],[0,0,1,0,0]],
[[0,0,1,0,0],[1,1,1,1,0],[1,1,0,0,1]],
[[1,1,0,0,1],[1,1,0,0,1],[1,0,0,0,0]]
])
DATA
op = torch.rand((4,3,5))
gt = torch.tensor([
[[0,1,1,0,0],[0,0,1,0,0],[1,1,0,0,1]],
[[1,1,0,0,1],[0,0,0,1,0],[0,0,1,0,0]],
[[0,0,1,0,0],[1,1,1,1,0],[1,1,0,0,1]],
[[1,1,0,0,1],[1,1,0,0,1],[1,0,0,0,0]]
], dtype=torch.float)
Now, if your output is in [0,1] (if it is not please provide a Sigmoid activation at the end of your model) you can compute the binary cross-entropy losses (N_class values for each point of each element) in this way:
torch.nn.BCELoss(reduction="none")(op, gt)
You can finally compute the average loss for each element of batch as:
torch.nn.BCELoss(reduction="none")(op, gt).mean(dim=[-1,-2])
If it is not the solution you are looking for or it is not clear let me know.
Related
I am trying to write a model that outputs a vector of length N consisting of labels -1,0 and 1. Each of the labels depicts one of three decisions for the system participants (wireless devices). So the vector depicts a system state that is then passed on to an optimization problem in the next step. Due to the fix problem formulation that is awaiting the output vector a selection of 0,1 and 2 instead is not possible.
After coming across this tanh function to supply the -1,0 and 1 values:
1.5 * backend.tanh(alpha * x) + 0.5 * (backend.tanh(-(3 / alpha) * x)) from here, I was wondering how exactly this output layer and the penultimate layer can be built to suply this vector of labels {-1,0,1}. I tried using the above function in the output layer in a simple Iris classificator. But this resulted in terrible accuracy compared to the one achieved with 0,1,2 and softmax output layer.
Thanks in advance,
with kind regards,
Yuka
It doesn't seem like the outputs are actually "numerically related", for lack of a better term. Meaning, the labels could just as well be "left", "right", "up". So I think your best bet is to have 3 output nodes in the final layer, with softmax activation function, with each of the three nodes representing each of the three labels, using a Cross entropy loss function.
If your training data currently has the target as -1/0/1, you should one-hot encode it so that each target is a vector of length 3. So label 0 might be [0,1,0]
I want to compute sum of cross entropy over all classes for each prediction, where the input is batch (size n), and the output is batch (size n).
The simplest way is for loop (for 1000 classes):
def sum_of_CE_lost(input):
L = 0
for c in range(1000):
L = L + torch.nn.CrossEntropyLoss(input, c)
return L
However, it is very slow. What is a better way? How can we parallelized it for GPU (CUDA)?
First of all, to make it faster, you need to vectorize it, that is, work with matrices.
So, image you have 1,000 samples to compute the loss. Also, your classification problem has 5 labels. To compute the CrossEntropyLoss we need an input and a target. Let's simulate that as follows:
loss = nn.CrossEntropyLoss() # the loss function
input = torch.randn(1000, 5) #1000 samples and 5 labels' predictions
target = torch.empty(1000, dtype=torch.long).random_(5) # 1000 samples with labels from 0 to 4
loss_value = loss(input, target) # It'll output the loss
There we go! Now the loss is computed considering the 1,000 samples. This is the fatest way to do that.
I found the answer:
torch.nn.functional.log_softmax (input).sum() / input.shape[0]
We divide by input.shape[0] because cross_entropy() takes, by default the mean across the batch dimension.
What will happen when I use batch normalization but set batch_size = 1?
Because I am using 3D medical images as training dataset, the batch size can only be set to 1 because of GPU limitation. Normally, I know, when batch_size = 1, variance will be 0. And (x-mean)/variance will lead to error because of division by 0.
But why did errors not occur when I set batch_size = 1? Why my network was trained as good as I expected? Could anyone explain it?
Some people argued that:
The ZeroDivisionError may not be encountered because of two cases. First, the exception is caught in a try catch block. Second, a small rational number is added ( 1e-19 ) to the variance term so that it is never zero.
But some people disagree. They said that:
You should calculate mean and std across all pixels in the images of the batch. (So even batch_size = 1, there are still a lot of pixels in the batch. So the reason why batch_size=1 can still work is not because of 1e-19)
I have checked the Pytorch source code, and from the code I think the latter one is right.
Does anyone have different opinion???
variance will be 0
No, it won't; BatchNormalization computes statistics only with respect to a single axis (usually the channels axis, =-1 (last) by default); every other axis is collapsed, i.e. summed over for averaging; details below.
More importantly, however, unless you can explicitly justify it, I advise against using BatchNormalization with batch_size=1; there are strong theoretical reasons against it, and multiple publications have shown BN performance degrade for batch_size under 32, and severely for <=8. In a nutshell, batch statistics "averaged" over a single sample vary greatly sample-to-sample (high variance), and BN mechanisms don't work as intended.
Small mini-batch alternatives: Batch Renormalization -- Layer Normalization -- Weight Normalization
Implementation details: from source code:
reduction_axes = list(range(len(input_shape)))
del reduction_axes[self.axis]
Eventually, tf.nn.monents is called with axes=reduction_axes, which performs a reduce_sum to compute variance. Then, in the TensorFlow backend, mean and variance are passed to tf.nn.batch_normalization to return train- or inference-normalized inputs.
In other words, if your input is (batch_size, height, width, depth, channels), or (1, height, width, depth, channels), then BN will run calculations over the 1, height, width, and depth dimensions.
Can variance ever be zero? - yes, if every single datapoint for any given channel slice (along every dimension) is the same. But this should be near-impossible for real data.
Other answers: first one is misleading:
a small rational number is added (1e-19) to the variance
This doesn't happen in computing variance, but it is added to variance when normalizing; nonetheless, it is rarely necessary, as variance is far from zero. Also, the epsilon term is actually defaulted to 1e-3 by Keras; it serves roles in regularizing, beyond mere avoiding zero-division.
Update: I failed to address an important piece of intuition with suspecting variance to be 0; indeed, the batch statistics variance is zero, since there is only one statistic - but the "statistic" itself concerns the mean & variance of the channel + spatial dimensions. In other words, the variance of the mean & variance (of the single train sample) is zero, but the mean & variance themselves aren't.
when batch_size = 1, variance will be 0
No, because when you compute mean and variance for BN (for example using tf.nn.monents) you will be computing it over axis [0, 1, 2] (assuming you have NHWC tensor channels order).
From "Group Normalization" paper:
https://arxiv.org/pdf/1803.08494.pdf
With batch_size=1 batch normalization is equal to instance normalization and it can be helpful in some tasks.
But if you are using sort of encoder-decoder and in some layer you have tensor with spatial size of 1x1 it will be a problem, because each channel only have only one value and mean of value will be equal to this value, so BN will zero out information.
I am trying to train an autoencoder on some simulated data where an input is basically a vector with Gaussian noise applied. The code is almost exactly the same as in this example: https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/autoencoder.py
The only differences are I changed the network parameters and the cost function:
n_hidden_1 = 32 # 1st layer num features
n_hidden_2 = 16 # 2nd layer num features
n_input = 149 # LunaH-Map data input (number of counts per orbit)
cost = tf.reduce_mean(-tf.reduce_sum(y_true * tf.log(y_pred), reduction_indices=[1]))
During training, the error steadily decreases down to 0.00015, but the predicted and true values are very different, e.g.
as shown in this image. In fact, the predicted y vector is almost all ones.
How is it possible to get decreasing error with very wrong predictions? Is it possible that my network is just trying to move the weights closer to log(1) so as to minimize the cross entropy cost? If so, how do I combat this?
Yes, the network simply learns to to predict 1 which reduces the loss. The cross-entropy loss you are using is categorical which is used when y_true is one-hot code (example: [0,0,1,0]) and final layer is softmax (ensures sum of all output is 1). So when y_true[idx] is 0, the loss don't care while when the y_true[idx] is 1 and y_pred[idx] is 0 there is infinite(high) loss but if its 1 then loss is again 0.
Now categorical cross-entropy loss is not suitable for autoencoders. For real valued inputs and hence outputs its mean-squared-error, which is what is used in example that you cited. But there the final activation layer is sigmoid, implicitly saying that each element of x is 0/1. So either you need to convert your data to support the same or have the last layer of decoder linear.
If you do want to use cross-entropy loss you can use binary cross-entropy
For inputs with 0,1 binary cross-entropy: tf.reduce_mean(y_true * tf.log(y_pred) + (1-y_true) * tf.log(1-y_pred)). If you work it out in both misprediction case 0-1, 1-0, the network gets infinite loss. Note again here the final layer should be softmax and elements of x should be between 0 and 1
I am trying to follow the udacity tutorial on tensorflow where I came across the following two lines for word embedding models:
# Look up embeddings for inputs.
embed = tf.nn.embedding_lookup(embeddings, train_dataset)
# Compute the softmax loss, using a sample of the negative labels each time.
loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases,
embed, train_labels, num_sampled, vocabulary_size))
Now I understand that the second statement is for sampling negative labels. But the question is how does it know what the negative labels are? All I am providing the second function is the current input and its corresponding labels along with number of labels that I want to (negatively) sample from. Isn't there the risk of sampling from the input set in itself?
This is the full example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/udacity/5_word2vec.ipynb
You can find the documentation for tf.nn.sampled_softmax_loss() here. There is even a good explanation of Candidate Sampling provided by TensorFlow here (pdf).
How does it know what the negative labels are?
TensorFlow will randomly select negative classes among all the possible classes (for you, all the possible words).
Isn't there the risk of sampling from the input set in itself?
When you want to compute the softmax probability for your true label, you compute: logits[true_label] / sum(logits[negative_sampled_labels]. As the number of classes is huge (the vocabulary size), there is very little probability to sample the true_label as a negative label.
Anyway, I think TensorFlow removes this possibility altogether when randomly sampling. (EDIT: #Alex confirms TensorFlow does this by default)
Candidate sampling explains how the sampled loss function is calculated:
Compute the loss function in a subset C of all training samples L, where C = T ⋃ S, T is the samples in target classes, and S is the randomly chosen samples in all classes.
The code you provided uses tf.nn.embedding_lookup to get the inputs [batch_size, dim] embed.
Then it uses tf.nn.sampled_softmax_loss to get the sampled loss function:
softmax_weights: A Tensor of shape [num_classes, dim].
softmax_biases: A Tensor of shape [num_classes]. The class biases.
embed: A Tensor of shape [batch_size, dim].
train_labels: A Tensor of shape [batch_size, 1]. The target classes T.
num_sampled: An int. The number of classes to randomly sample per batch. the numbed of classes in S.
vocabulary_size: The number of possible classes.
sampled_values: default to log_uniform_candidate_sampler
For one batch, the target samples are just train_labels (T). It chooses num_sampled samples from embed randomly (S) to be negative samples.
It will uniformly sample from embed respect to the softmax_wiehgt and softmax_bias. Since embed is embeddings[train_dataset] (of shape [batch_size, embedding_size]), if embeddings[train_dataset[i]] contains train_labels[i], it might be selected back, then it is not negative label.
According to Candidate sampling page 2, there are different types. For NCE and negative sampling, NEG=S, which may contain a part of T; for sampled logistic, sampled softmax, NEG = S-T explicitly delete T.
Indeed, it might be a chance of sampling from train_ set.