I had a quick question regarding the KL divergence loss as while I'm researching I have seen numerous different implementations. The two most commmon are these two. However, while look at the mathematical equation, I'm not sure if mean should be included.
KL_loss = -0.5 * torch.sum(1 + torch.log(sigma**2) - mean**2 - sigma**2)
OR
KL_loss = -0.5 * torch.sum(1 + torch.log(sigma**2) - mean**2 - sigma**2)
KL_loss = torch.mean(KL_loss)
Thank you!
The equation being used here calculates the loss for a single example:
For batches of data, we need to calculate the loss over multiple examples.
Using our per example equation, we get multiple loss values, 1 per example. We need some way to reduce the per example loss calculations to a single scalar value. Most commonly, you want to take the mean over the batch. You'll see that most of pytorch's loss functions use reduction="mean". The advantage of taking the mean instead of the sum is that our loss becomes batch size invariant (i.e. doesn't scale with batch size).
From the stackoverflow post you linked with the implementations, you'll see the first and second linked implementations take the mean over the batch (i.e. divide by the batch size).
KLD = -0.5 * torch.sum(1 + log_var - mean.pow(2) - log_var.exp())
...
(BCE + KLD) / x.size(0)
KL_loss = -0.5 * torch.sum(1 + logv - mean.pow(2) - logv.exp())
...
(NLL_loss + KL_weight * KL_loss) / batch_size
The third linked implementation takes the mean over not just the batch, but also the sigma/mu vectors themselves:
0.5 * torch.mean(mean_sq + stddev_sq - torch.log(stddev_sq) - 1)
So instead of scaling the sum by 1/N where N is the batch size, you're scaling by 1/(NM) where M is the dimensionality of the mu and sigma vectors. In this case, your loss is both batch size and latent dimension size invariant. It's important to note that scaling your loss doesn't change the "shape" of the loss landscape (i.e. optimal points stay fixed), it just scales it (which you can control how to step through via the learning rate).
Related
I am following this variational autoencoder tutorial: https://keras.io/examples/generative/vae/.
I know VAE's loss function consists of the reconstruction loss that compares the original image and reconstruction, as well as the KL loss. However, I'm a bit confused about the reconstruction loss and whether it is over the entire image (sum of squared differences) or per pixel (average sum of squared differences). My understanding is that the reconstruction loss should be per pixel (MSE), but the example code I am following multiplies MSE by 28 x 28, the MNIST image dimensions. Is that correct? Furthermore, my assumption is this would make the reconstruction loss term significantly larger than the KL loss and I'm not sure we want that.
I tried removing the multiplication by (28x28), but this resulted in extremely poor reconstructions. Essentially all the reconstructions looked the same regardless of the input. Can I use a lambda parameter to capture the tradeoff between kl divergence and reconstruction, or it that incorrect because the loss has a precise derivation (as opposed to just adding a regularization penalty).
reconstruction_loss = tf.reduce_mean(
keras.losses.binary_crossentropy(data, reconstruction)
)
reconstruction_loss *= 28 * 28
kl_loss = 1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var)
kl_loss = tf.reduce_mean(kl_loss)
kl_loss *= -0.5
total_loss = reconstruction_loss + kl_loss
The example
I'm familiar with that example, and I think the 28x28 multiplier is justified because of the operation tf.reduce_mean(kl_loss) which takes the average loss of all the pixels in the image which would result in a number between 0 and 1 and then multiplies it by the number of pixels. Here's another take with an external training loop for creating a VAE.
The problem is posterior collapse
The above would not be an issue since it's just multiplication by a constant if not for as you point out the KL divergence term. The KL loss acts as a regularizer that penalizes latent variable probability distributions that when sampled using a combination of Gaussians are different than the samples created by the encoder. Naturally, the question arises, how much should be reconstruction loss and how much should be the penalty. This is an area of research. Consider β-VAE which purportedly serves to disentangle representations by increasing the importance of KL-loss, on the other hand, increase β too much and you get a phenomenon known as posterior collapse Re-balancing Variational Autoencoder Loss for Molecule Sequence Generation limits β to 0.1 to avoid the problem. But it may not even be that simple as explained in The Usual Suspects? Reassessing Blame for VAE Posterior Collapse. A thorough solution is proposed in Diagnosing and Enhancing VAE Models. While Balancing reconstruction error and Kullback-Leibler divergence in Variational Autoencoders
suggest that there is a more simple deterministic (and better) way.
Experimentation and Extension
For something simple like Minst, and that example, in particular, try experimenting. Keep the 28x28 term, and arbitrarily multiply kl_loss by a constant B where 0 <= B < 28*28. Follow the kl loss term and the reconstruction loss term during training and compare it to the first reference graphs.
It isn't really necessary to multiply by the number of pixels. However, whether you do so or not will affect the way your fitting algorithm behaves with respect to the other hyper parameters: your lambda parameter and the learning rate. In essence, if you want to remove the multiplication by 28 x 28 but retain the same fitting behavior, you should divide lambda by 28 x 28 and then multiply your learning rate by 28 x 28. I think you were already approaching this idea in your question, and the piece you were missing is the adjustment to the learning rate.
I have two tensors that I am calculating the Spearmans Rank Correlation from, and I would like to be able to have PyTorch automatically adjust the values in these Tensors in a way that increases my Spearmans Rank Correlation number as high as possible.
I have explored autograd but nothing I've found has explained it simply enough.
Initialized tensors:
a=Var(torch.randn(20,1),requires_grad=True)
psfm_s=Var(torch.randn(12,20),requires_grad=True)
How can I have a loop of constant adjustments of the values in these two tensors to get the highest spearmans rank correlation from 2 lists I make from these 2 tensors while having PyTorch do the work? I just need a guide of where to go. Thank you!
I'm not familiar with Spearman's Rank Correlation, but if I understand your question you're asking how to use PyTorch to solve problems other than deep networks?
If that's the case then I'll provide a simple least squares example which I believe should be informative to your effort.
Consider a set of 200 measurements of 10 dimensional vectors x and y. Say we want to find a linear transform from x to y.
The least squares approach dictates we can accomplish this by finding the matrix M and vector b which minimize |(y - (M x+b))²|
The following example code generates some example data and then uses pytorch to perform this minimization. I believe the comments are sufficient to help you understand what is occurring here.
import torch
from torch.nn.parameter import Parameter
from torch import optim
# define some fake data
M_true = torch.randn(10, 10)
b_true = torch.randn(10, 1)
x = torch.randn(200, 10, 1)
noise = torch.matmul(M_true, 0.05 * torch.randn(200, 10, 1))
y = torch.matmul(M_true, x) + b_true + noise
# begin optimization
# define the parameters we want to optimize (using random starting values in this case)
M = Parameter(torch.randn(10, 10))
b = Parameter(torch.randn(10, 1))
# define the optimizer and provide the parameters we want to optimize
optimizer = optim.SGD((M, b), lr=0.1)
for i in range(500):
# compute loss that we want to minimize
y_hat = torch.matmul(M, x) + b
loss = torch.mean((y - y_hat)**2)
# zero the gradients of the parameters referenced by the optimizer (M and b)
optimizer.zero_grad()
# compute new gradients
loss.backward()
# update parameters M and b
optimizer.step()
if (i + 1) % 100 == 0:
# scale learning rate by factor of 0.9 every 100 steps
optimizer.param_groups[0]['lr'] *= 0.9
print('step', i + 1, 'mse:', loss.item())
# final parameter values (data contains a torch.tensor)
print('Resulting parameters:')
print(M.data)
print(b.data)
print('Compare to the "real" values')
print(M_true)
print(b_true)
Of course this problem has a simple closed form solution, but this numerical approach is just to demonstrate how to use PyTorch's autograd to solve problems not necessarily neural network related. I also choose to explicitly define the matrix M and vector b here rather than using an equivalent nn.Linear layer since I think that would just confuse things.
In your case you want to maximize something so make sure to negate your objective function before calling backward.
I have a question concerning learning rate decay in Keras. I need to understand how the option decay works inside optimizers in order to translate it to an equivalent PyTorch formulation.
From the source code of SGD I see that the update is done this way after every batch update:
lr = self.lr * (1. / (1. + self.decay * self.iterations))
Does this mean that after every batch update the lr is updated starting from its value from its previous update or from its initial value? I mean, which of the two following interpretation is the correct one?
lr = lr_0 * (1. / (1. + self.decay * self.iterations))
or
lr = lr * (1. / (1. + self.decay * self.iterations)),
where lr is the lr updated after previous iteration and lr_0 is always the initial learning rate.
If the correct answer is the first one, this would mean that, in my case, the learning rate would decay from 0.001 to just 0.0002 after 100 epochs, whereas in the second case it would decay from 0.001 at around 1e-230 after 70 epochs.
Just to give you some context, I'm working with a CNN for a regression problem from images and I just have to translate Keras code into Pytorch code. So far, with the second of the afore-mentioned interpretations I manage to only always predict the same value, disregarding of batch size and input at test time.
Thanks in advance for your help!
Based on the implementation in Keras I think your first formulation is the correct one, the one that contain the initial learning rate (note that self.lr is not being updated).
However I think your calculation is probably not correct: since the denominator is the same, and lr_0 >= lr since you are doing decay, the first formulation has to result in a bigger number.
I'm not sure if this decay is available in PyTorch, but you can easily create something similar with torch.optim.lr_scheduler.LambdaLR.
decay = .001
fcn = lambda step: 1./(1. + decay*step)
scheduler = LambdaLR(optimizer, lr_lambda=fcn)
Finally, don't forget that you will need to call .step() explicitly on the scheduler, it's not enough to step your optimizer. Also, most often learning scheduling is only done after a full epoch, not after every single batch, but I see that here you are just recreating Keras behavior.
Actually, the response of mkisantal might be incorrect, since the actual equation for the learning rate in keras (at least it was, now there is no default decay option) was like this:
lr = lr * (1. / (1. + self.decay * self.iterations))
(see https://github.com/keras-team/keras/blob/2.2.0/keras/optimizers.py#L178)
And the solution presented by mkisantal is missing the recurrent/multiplicative term lr, therefore the more accurate version should be based on MultiplicativeLR:
decay = .001
fcn = lambda step: 1./(1. + decay*step)
scheduler = MultiplicativeLR(optimizer, lr_lambda=fcn)
When I use keras's binary_crossentropy as the loss function (that calls tensorflow's sigmoid_cross_entropy, it seems to produce loss values only between [0, 1]. However, the equation itself
# The logistic loss formula from above is
# x - x * z + log(1 + exp(-x))
# For x < 0, a more numerically stable formula is
# -x * z + log(1 + exp(x))
# Note that these two expressions can be combined into the following:
# max(x, 0) - x * z + log(1 + exp(-abs(x)))
# To allow computing gradients at zero, we define custom versions of max and
# abs functions.
zeros = array_ops.zeros_like(logits, dtype=logits.dtype)
cond = (logits >= zeros)
relu_logits = array_ops.where(cond, logits, zeros)
neg_abs_logits = array_ops.where(cond, -logits, logits)
return math_ops.add(
relu_logits - logits * labels,
math_ops.log1p(math_ops.exp(neg_abs_logits)), name=name)
implies that the range is from [0, infinity). So is Tensorflow doing some sort of clipping that I'm not catching? Moreover, since it's doing math_ops.add() I'd assume it'd be for sure greater than 1. Am I right to assume that loss range can definitely exceed 1?
The cross entropy function is indeed not bounded upwards. However it will only take on large values if the predictions are very wrong. Let's first look at the behavior of a randomly initialized network.
With random weights, the many units/layers will usually compound to result in the network outputing approximately uniform predictions. That is, in a classification problem with n classes you will get probabilities of around 1/n for each class (0.5 in the two-class case). In this case, the cross entropy will be around the entropy of an n-class uniform distribution, which is log(n), under certain assumptions (see below).
This can be seen as follows: The cross entropy for a single data point is -sum(p(k)*log(q(k))) where p are the true probabilities (labels), q are the predictions, k are the different classes and the sum is over the classes. Now, with hard labels (i.e. one-hot encoded) only a single p(k) is 1, all others are 0. Thus, the term reduces to -log(q(k)) where k is now the correct class. If with a randomly initialized network q(k) ~ 1/n, we get -log(1/n) = log(n).
We can also go of the definition of the cross entropy which is generally entropy(p) + kullback-leibler divergence(p,q). If p and q are the same distributions (e.g. p is uniform when we have the same number of examples for each class, and q is around uniform for random networks) then the KL divergence becomes 0 and we are left with entropy(p).
Now, since the training objective is usually to reduce cross entropy, we can think of log(n) as a kind of worst-case value. If it ever gets higher, there is probably something wrong with your model. Since it looks like you only have two classes (0 and 1), log(2) < 1 and so your cross entropy will generally be quite small.
I'm confused regarding as to how the adam optimizer actually works in tensorflow.
The way I read the docs, it says that the learning rate is changed every gradient descent iteration.
But when I call the function I give it a learning rate. And I don't call the function to let's say, do one epoch (implicitly calling # iterations so as to go through my data training). I call the function for each batch explicitly like
for epoch in epochs
for batch in data
sess.run(train_adam_step, feed_dict={eta:1e-3})
So my eta cannot be changing. And I'm not passing a time variable in. Or is this some sort of generator type thing where upon session creation t is incremented each time I call the optimizer?
Assuming it is some generator type thing and the learning rate is being invisibly reduced: How could I get to run the adam optimizer without decaying the learning rate? It seems to me like RMSProp is basically the same, the only thing I'd have to do to make it equal (learning rate disregarded) is to change the hyperparameters momentum and decay to match beta1 and beta2 respectively. Is that correct?
I find the documentation quite clear, I will paste here the algorithm in pseudo-code:
Your parameters:
learning_rate: between 1e-4 and 1e-2 is standard
beta1: 0.9 by default
beta2: 0.999 by default
epsilon: 1e-08 by default
The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.
Initialization:
m_0 <- 0 (Initialize initial 1st moment vector)
v_0 <- 0 (Initialize initial 2nd moment vector)
t <- 0 (Initialize timestep)
m_t and v_t will keep track of a moving average of the gradient and its square, for each parameters of the network. (So if you have 1M parameters, Adam will keep in memory 2M more parameters)
At each iteration t, and for each parameter of the model:
t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)
m_t <- beta1 * m_{t-1} + (1 - beta1) * gradient
v_t <- beta2 * v_{t-1} + (1 - beta2) * gradient ** 2
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
Here lr_t is a bit different from learning_rate because for early iterations, the moving averages have not converged yet so we have to normalize by multiplying by sqrt(1 - beta2^t) / (1 - beta1^t). When t is high (t > 1./(1.-beta2)), lr_t is almost equal to learning_rate
To answer your question, you just need to pass a fixed learning rate, keep beta1 and beta2 default values, maybe modify epsilon, and Adam will do the magic :)
Link with RMSProp
Adam with beta1=1 is equivalent to RMSProp with momentum=0. The argument beta2 of Adam and the argument decay of RMSProp are the same.
However, RMSProp does not keep a moving average of the gradient. But it can maintain a momentum, like MomentumOptimizer.
A detailed description of rmsprop.
maintain a moving (discounted) average of the square of gradients
divide gradient by the root of this average
(can maintain a momentum)
Here is the pseudo-code:
v_t <- decay * v_{t-1} + (1-decay) * gradient ** 2
mom = momentum * mom{t-1} + learning_rate * gradient / sqrt(v_t + epsilon)
variable <- variable - mom
RMS_PROP and ADAM both have adaptive learning rates .
The basic RMS_PROP
cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)
You can see originally this has two parameters decay_rate & eps
Then we can add a momentum to make our gradient more stable Then we can write
cache = decay_rate * cache + (1 - decay_rate) * dx**2
**m = beta1*m + (1-beta1)*dx** [beta1 =momentum parameter in the doc ]
x += - learning_rate * dx / (np.sqrt(cache) + eps)
Now you can see here if we keep beta1 = o Then it's rms_prop without the momentum .
Then Basics of ADAM
In cs-231 Andrej Karpathy has initially described the adam like this
Adam is a recently proposed update that looks a bit like RMSProp with
momentum
So yes ! Then what makes this difference from the rms_prop with momentum ?
m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
**x += - learning_rate * m / (np.sqrt(v) + eps)**
He again mentioned in the updating equation m , v are more smooth .
So the difference from the rms_prop is the update is less noisy .
What makes this noise ?
Well in the initialization procedure we will initialize m and v as zero .
m=v=0
In order to reduce this initializing effect it's always to have some warm-up . So then equation is like
m = beta1*m + (1-beta1)*dx beta1 -o.9 beta2-0.999
**mt = m / (1-beta1**t)**
v = beta2*v + (1-beta2)*(dx**2)
**vt = v / (1-beta2**t)**
x += - learning_rate * mt / (np.sqrt(vt) + eps)
Now we run this for few iterations . Clearly pay attention to the bold lines , you can see when t is increasing (iteration number) following thing happen to the mt ,
mt = m