In TF2 keras, I have trained an Autoencoder using tensorflow.keras.losses.MeanSquaredError as loss function. Now, I want to further train this model by using another loss function, specifically tensorflow.keras.losses.KLDivergence. The reason for this is that initially unsupervised learning is conducted for representation learning. Then, having the generated embeddings, I can cluster them and use these clusters for self-supervision, i.e. labels, enabling the second, supervised loss and improving the model further.
This is not transfer learning per se, as no new layers are added to the model, just the loss function is changed and the model continues training.
What I have tried is using the pretrained model with the MSE loss as the new model's property:
class ClusterBooster(tf.keras.Model):
def __init__(self, base_model, centers):
super(ClusterBooster, self).__init__()
self.pretrained = base_model
self.centers = centers
def train_step(self, data):
with tf.GradientTape() as tape:
loss = self.compiled_loss(self.P, self.Q, regularization_losses=self.losses)
# Compute gradients
gradients = tape.gradient(loss, self.trainable_variables)
# Update weights
self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))
return {m.name: m.result() for m in self.metrics}
where the loss is the KL loss between distributions P and Q. The distributions are computed in a callback function instead of the model train_step as I need access to the current epoch (P is updated every 5 epochs, not on each epoch):
def on_epoch_begin(self, epoch, logs=None):
z = self.model.pretrained.embed(self.feature, training=True)
z = tf.reshape(z, [tf.shape(z)[0], 1, tf.shape(z)[1]]) # reshape for broadcasting
# CALCULATE Q FOR EVERY EPOCH
partial = tf.math.pow(tf.norm(z - self.model.centers, axis=2, ord='euclidean'), 2)
nominator = 1 / (1 + partial)
denominator = tf.math.reduce_sum(1 / (1 + partial))
self.model.Q = nominator / denominator
# CALCULATE P EVERY 5 EPOCHS TO AVOID INSTABILITY
if epoch % 5 == 0:
partial = tf.math.pow(self.model.Q, 2) / tf.math.reduce_sum(self.model.Q, axis=1, keepdims=True)
nominator = partial
denominator = tf.math.reduce_sum(partial, axis=0)
self.model.P = nominator / denominator
However, when apply_gradients() is executed I get:
ValueError: No gradients provided for any variable: ['dense/kernel:0', 'dense/bias:0', 'dense_1/kernel:0', 'dense_1/bias:0', 'dense_2/kernel:0', 'dense_2/bias:0', 'dense_3/kernel:0', 'dense_3/bias:0']
I think that this is due to the fact that the pretrained model is not set to be further trained somewhere inside the new model (only the embed() method is called, which does not train the model). Is this a correct approach and I am just missing something or is there a better way?
It seems that whatever computation takes place in a callback, isn't tracked for gradient computation and weight updating. Thus, these computations should be put inside the train_step() function of the custom Model class (ClusterBooster).
Providing that I don't have access to the number of epochs inside the train_step() function of ClusterBooster, I created a custom training loop without a Model class, where I could use plain python code (which is computed eagerly).
Related
I am replicating a paper. I have a basic Keras CNN model for MNIST classification. Now for sample z in the training, I want to calculate the hessian matrix of the model parameters with respect to the loss of that sample. I want to average out this hessian over the training data (n is number of training data).
My final goal is to calculate this value (the influence score):
I can calculate the left term and the right term and want to compute the Hessian term. I don't know how to calculate hessian for the model weights for a batch of examples (vectorization). I was able to calculate it only for a sample at a time which is too slow.
x=tf.convert_to_tensor(x_train[0:13])
with tf.GradientTape() as t2:
with tf.GradientTape() as t1:
y=model(x)
mce = tf.keras.losses.CategoricalCrossentropy()
y_expanded=y_train[train_idx]
loss=mce(y_expanded,y)
g = t1.gradient(loss, model.weights[4])
h = t2.jacobian(g, model.weights[4])
print(h.shape)
For clarification, if a model layer is of dimension 20*30, I want to feed a batch of 13 samples to it and get a Hessian of dimension (13,20,30,20,30). Now I can only get Hessian of dimension (20,30,20,30) which thwarts the vectorization (the code above).
This thread has the same problem, except that I want the second-order derivative rather than the first-order.
I also tried the below script which returns a (13,20,30,20,30) matrix that satisfies the dimension, but when I manually checked the sum of this matrix with the sum of 13 single hessian calculations with a for loop from 0 to 12, they lead to different numbers so it does not work either since I expected equal values.
x=tf.convert_to_tensor(x_train[0:13])
mce = tf.keras.losses.CategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE)
with tf.GradientTape() as t2:
with tf.GradientTape() as t1:
t1.watch(model.weights[4])
y_expanded=y_train[0:13]
y=model(x)
loss=mce(y_expanded,y)
j1=t1.jacobian(loss, model.weights[4])
j3 = t2.jacobian(j1, model.weights[4])
print(j3.shape)
That's how hessians are defined, you can only calculate a hessian of a scalar function.
But nothing new here, the same happens with gradients, and what is done to handle batches is to accumulate the gradients, something similar can be done with the hessian.
If you know how to compute the hessian of the loss, it means you could define batch cost and still be able to compute the hessian with the same method. e.g. you could define your cost as the sum(losses) where losses is the vector of losses for all examples in the batch.
Let's Suppose you have a model and you wanna train the model weights by taking the Hessian of the training images w.r.t trainable-weights
#Import the libraries we need
import tensorflow as tf
from tensorflow.python.eager import forwardprop
model = tf.keras.models.load_model('model.h5')
#Define the Adam Optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.98,
epsilon=1e-9)
#Define the loss function
def loss_function(y_true , y_pred):
return tf.keras.losses.sparse_categorical_crossentropy(y_true , y_pred , from_logits=True)
#Define the Accuracy metric function
def accuracy_function(y_true , y_pred):
return tf.keras.metrics.sparse_categorical_accuracy(y_true , y_pred)
Now, define the variables for storing the mean of the loss and accuracy
train_loss = tf.keras.metrics.Mean(name='loss')
train_accuracy = tf.keras.metrics.Mean(name='accuracy')
#Now compute the Hessian in some different style for better efficiency of the model
vector = [tf.ones_like(v) for v in model.trainable_variables]
def _forward_over_back_hvp(images, labels):
with forwardprop.ForwardAccumulator(model.trainable_variables, vector) as acc:
with tf.GradientTape() as grad_tape:
logits = model(images, training=True)
loss = loss_function(labels ,logits)
grads = grad_tape.gradient(loss, model.trainable_variables)
hessian = acc.jvp(grads)
optimizer.apply_gradients(zip(hessian, model.trainable_variables))
train_loss(loss) #keep adding the loss
train_accuracy(accuracy_function(labels, logits)) #Keep adding the accuracy
#Now, here we need to call the function and train it
import time
for epoch in range(20):
start = time.time()
train_loss.reset_states()
train_accuracy.reset_states()
for i,(x , y) in enumerate(dataset):
_forward_over_back_hvp(x , y)
if(i%50==0):
print(f'Epoch {epoch + 1} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')
print(f'Time taken for 1 epoch: {time.time() - start:.2f} secs\n')
Epoch 1 Loss 2.6396 Accuracy 0.1250
Time is taken for 1 epoch: 0.23 secs
I'm trying to implement linear classifier in PyTorch, using 1 layer with tensors W and b, softmax and cross entropy loss. For each batch I have to:
Calculate logits
Transform logits to probabilities with softmax
Compute most probable classes
Calculate cross entropy between true and predicted classes
Use an optimizer to change W and b
So far I have (I have flat MNIST loaded with Scikit-learn):
# convert Numpy arrays to PyTorch tensor Variables
input_X_train = torch.from_numpy(X_train_flat).float().to(device)
input_X_val = torch.from_numpy(X_val_flat).float().to(device)
input_X_test = torch.from_numpy(X_test_flat).float().to(device)
input_y_train = torch.from_numpy(y_train).long().to(device)
input_y_val = torch.from_numpy(y_val).long().to(device)
input_y_test = torch.from_numpy(y_test).long().to(device)
# model parameters: W and b
W = torch.randn(input_dim, output_dim, device=device, dtype=dtype, requires_grad=True)
b = torch.randn(1, device=device, dtype=dtype, requires_grad=True)
BATCH_SIZE = 512
EPOCHS = 40
LEARNING_RATE = 1e-6
# create torch.optim.Adam optimizer for loss function minimization
optimizer = torch.optim.Adam([W, b], lr=LEARNING_RATE)
# create negative log loss function object for loss function evaluation
# use mean loss value from all batch samples
loss_fn = torch.nn.NLLLoss(reduction="mean")
for t in range(EPOCHS):
# logits for input_X, resulting shape should be [input_X.shape[0], 10]
logits = torch.matmul(input_X_train, W) + b
# apply torch.nn.functional.softmax (torch_F.softmax) to logits
probas = torch_f.softmax(logits, dim=1)
# apply torch.argmax to find a class index with highest probability
classes = torch.argmax(probas, dim=1)
# loss should be a scalar number: average loss over all the objects with torch.mean()
# PyTorch implements negative log loss (NLL) *without* log - you have to first compute log of
# softmax, then negative log loss, which will swap sign
# Use torch.nn.functional.log_softmax (torch_f.log_softmax) on top of input_y and logits
# It is identical to calculating cross-entropy (log and then NLL) on top of probas,
# but is more numerically friendly (read the docs).
log_probas = torch_f.log_softmax(logits, dim=1)
loss = loss_fn(log_probas, input_y_train)
# Before the backward pass, use the optimizer object to zero all of the
# gradients for the variables it will update (which are the learnable
# weights of the model). This is because by default, gradients are
# accumulated in buffers( i.e, not overwritten) whenever .backward()
# is called. Checkout docs of torch.autograd.backward for more details.
optimizer.zero_grad()
# calculate backward gradients for backpropagation
loss.backward()
# Calling the step function on an Optimizer makes an update to its parameters
optimizer.step()
For some reason, the W and b don't change. What am I doing wrong?
EDIT:
I've seen and tried in the code above e. g. this minimal working example https://discuss.pytorch.org/t/minimal-working-example-of-optim-sgd/11623/2.
EDIT 2:
Gradients W.grad are often, I think it should not be like that. Probabilities of classes are definitely right (so it's not e. g. like this example), since I've checked sum of every row and probabilities of all classes for each sample sum to 1.
I'm trying to implement a variant of a Variational AutoEncoder with KL warmup in TensorFlow (paper here). The idea is that the KL term of the loss should be increased linearly over a specified number of epochs at the beginning of training.
The way I tried was using a callback that sets a value in a K.variable every time a new epoch begins, as the current number of epochs over the desired span of the warmup (for example, if the warmup is set to last for 10 epochs, at epoch 6 the KL term in the loss should be multiplied by 0.6).
I'm also including an add_metric() in the KL (which is implemented as a layer subclass) to control the kl_rate during training. The problem is that the value of the variable is unstable! It starts close to the desired value at each new epoch, but it slowly decays on every iteration, making the process not very controllable.
Do you have any idea what I'm doing wrong? I'm also not sure if it's a problem of the callback itself (and subsequently of the actual used value) or of the reported metric.
Thanks!
The imports:
import tensorflow.keras.backend as K
The callback (self.kl_warmup is a parameter of the model class that is set to an integer, corresponding to the number of epochs during which the kl rate should be increased):
kl_beta = K.variable(1.0, name="kl_beta")
if self.kl_warmup:
kl_warmup_callback = LambdaCallback(
on_epoch_begin=lambda epoch, logs: K.set_value(
kl_beta, K.min([epoch / self.kl_warmup, 1])
)
)
z_mean, z_log_sigma = KLDivergenceLayer(beta=kl_beta)([z_mean, z_log_sigma])
The KL layer:
class KLDivergenceLayer(Layer):
""" Identity transform layer that adds KL divergence
to the final model loss.
"""
def __init__(self, beta=1.0, *args, **kwargs):
self.is_placeholder = True
self.beta = beta
super(KLDivergenceLayer, self).__init__(*args, **kwargs)
def get_config(self):
config = super().get_config().copy()
config.update({"beta": self.beta})
return config
def call(self, inputs, **kwargs):
mu, log_var = inputs
kL_batch = -0.5 * K.sum(1 + log_var - K.square(mu) - K.exp(log_var), axis=-1)
self.add_loss(self.beta * K.mean(kL_batch), inputs=inputs)
self.add_metric(self.beta, aggregation="mean", name="kl_rate")
return inputs
The model instance (the entire model is built inside a class that returns encoder, generator, full vae and the kl_rate callback):
encoder, generator, vae, kl_warmup_callback = SEQ_2_SEQ_VAE(pttest.shape,
loss='ELBO',
kl_warmup_epochs=10).build()
The fit() call:
history = vae.fit(x=pttrain, y=pttrain, epochs=100, batch_size=512, verbose=1,
validation_data=(pttest, pttest),
callbacks=[tensorboard_callback, kl_warmup_callback])
A snippet of the training process (note the kl_rate that should be zero and it's off):
A screenshot of the kl_rate over epochs from tensorboard (the span was set to 10 epochs; after 10 epochs it should reach 1, but it converges to about 0.9)
Using a callback will work, but it's a little clunky. If you're happy working in terms of training steps (iterations) rather than epochs then that value is already being tracked/updated in the optimizer.
Creating a new layer is also a bit misleading if it doesn't have any variables or perform any operation. Using a custom activity_regularizer on the layer producing distribution parameters are a better fit here, and will prevent you from accidentally using the un-regularized parameters.
Without seeing you SEQ_2_SEQ_VAE it's difficult to give an exact code example, but hopefully the below gives enough of an idea about how to implement.
class KLDivergenceRegularizer(tf.keras.regularizers.Regularizer):
def __init__(self, iters: tf.Variable, warm_up_iters: int):
self._iters = iters
self._warm_up_iters = warm_up_iters
def __call__(self, activation):
# note: activity regularizers automatically divide by batch size
mu, log_var = activation
k = K.min(self._iters / self._warm_up_iters, 1)
return -0.5 * k * K.sum(1 + log_var - K.square(mu) - K.exp(log_var))
optimizer = tf.keras.optimizers.Adam() # or whatever optimizer you want
warm_up_iters = 1000 # not epochs
inp = make_input()
x = tf.keras.layers.Dense(...)(inp)
...
mu, log_sigma = ParametersLayer(
...,
activity_regularizer=KLDivergenceRegularizer(optimizer.iterations, warm_up_iters))(x)
...
vae = bulid_vae(...)
vae.compile(optimizer=optimizer, ...)
If you have multiple things like this, you could use tf.summary.experimental.set_step(optimizer.iterations) and tf.summary.experimental.get_step() inside regularizers.
I ended up discovering it myself after a bit more research.
kl_beta._trainable = False
did the trick :)
Thanks!
I have a regression task and am measuring fit using Euclidean distance. Instead of displaying the mean squared error as the loss, I want to display the sum of squares. That is, I want to only sum over the square error terms and not divide by the number of examples.
On a batch level I can achieve this by defining a custom loss like so (maybe I could instead use tf.keras.losses.MeanSquareError directly):
class CustomLoss(tf.keras.losses.Loss):
def call(self, Y_true, Y_pred):
return tf.reduce_sum(tf.math.abs(Y_true-Y_pred) ** 2, axis=-1)
target_loss=CustomLoss(reduction=tf.keras.losses.Reduction.SUM)
Which will compute the square error for each example and then instruct TensorFlow to SUM over the examples to compute the batch loss instead of the default SUM_OVER_BATCH_SIZE (which should not be read literally, but as a fraction, i.e., SUM / BATCH_SIZE).
My problem is that, on an epoch level, Keras takes these sums and then computes the mean across steps (batches) to report the loss of the epoch. How do I get Keras to compute the sum over batches instead of the mean?
You will have to write a Custom Callback which will append losses after each batch to the list(as shown in shared link doc).
Implement the
on_epoch_end to get the sum of all the values in the list(where you added all the batch losses)
If you want to minimize sum of losses over all the batches, use K.Function API. Full implementation
You can sum over the batches in tf.keras.metric.Metric like below, but right now there is an open issue pending in 2.4.x (please see this GitHub issue), you can try with 2.3.2 though,
class AddAllOnes(tf.keras.metrics.Metric):
""" A simple metric that adds all the one's in current batch and suppose to return the total ones seen at every end of batch"""
def __init__(self, name="add_all_ones", **kwargs):
super(AddAllOnes, self).__init__(name=name, **kwargs)
self.total = self.add_weight(name="total", initializer="zeros")
def update_state(self, y_true, y_pred, sample_weight=None):
self.total.assign_add(tf.cast(tf.reduce_sum(y_true), dtype=tf.float32))
def result(self):
print('')
print('inside result...', self.total)
return self.total
X_train = np.random.random((512, 8))
y_train = np.random.randint(0, 2, (512, 1))
K.clear_session()
model_inputs = Input(shape=(8,))
model_unit = Dense(256, activation='linear', use_bias=False)(model_inputs)
model_unit = BatchNormalization()(model_unit)
model_unit = Activation('sigmoid')(model_unit)
model_outputs = Dense(1, activation='sigmoid')(model_unit)
optim = Adam(learning_rate=0.001)
model = Model(inputs=model_inputs, outputs=model_outputs)
model.compile(loss='binary_crossentropy', optimizer=optim, metrics=[AddAllOnes()], run_eagerly=True)
model.fit(X_train, y_train, verbose=1, batch_size=32)
Why does zero_grad() need to be called during training?
| zero_grad(self)
| Sets gradients of all model parameters to zero.
In PyTorch, for every mini-batch during the training phase, we typically want to explicitly set the gradients to zero before starting to do backpropragation (i.e., updating the Weights and biases) because PyTorch accumulates the gradients on subsequent backward passes. This accumulating behaviour is convenient while training RNNs or when we want to compute the gradient of the loss summed over multiple mini-batches. So, the default action has been set to accumulate (i.e. sum) the gradients on every loss.backward() call.
Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly. Otherwise, the gradient would be a combination of the old gradient, which you have already used to update your model parameters, and the newly-computed gradient. It would therefore point in some other direction than the intended direction towards the minimum (or maximum, in case of maximization objectives).
Here is a simple example:
import torch
from torch.autograd import Variable
import torch.optim as optim
def linear_model(x, W, b):
return torch.matmul(x, W) + b
data, targets = ...
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
optimizer = optim.Adam([W, b])
for sample, target in zip(data, targets):
# clear out the gradients of all Variables
# in this optimizer (i.e. W, b)
optimizer.zero_grad()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
optimizer.step()
Alternatively, if you're doing a vanilla gradient descent, then:
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
for sample, target in zip(data, targets):
# clear out the gradients of Variables
# (i.e. W, b)
W.grad.data.zero_()
b.grad.data.zero_()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
W -= learning_rate * W.grad.data
b -= learning_rate * b.grad.data
Note:
The accumulation (i.e., sum) of gradients happens when .backward() is called on the loss tensor.
As of v1.7.0, Pytorch offers the option to reset the gradients to None optimizer.zero_grad(set_to_none=True) instead of filling them with a tensor of zeroes. The docs claim that this setting reduces memory requirements and slightly improves performance, but might be error-prone if not handled carefully.
Although the idea can be derived from the chosen answer, but I feel like I want to write that explicitly.
Being able to decide when to call optimizer.zero_grad() and optimizer.step() provides more freedom on how gradient is accumulated and applied by the optimizer in the training loop. This is crucial when the model or input data is big and one actual training batch do not fit in to the gpu card.
Here in this example from google-research, there are two arguments, named train_batch_size and gradient_accumulation_steps.
train_batch_size is the batch size for the forward pass, following the loss.backward(). This is limited by the gpu memory.
gradient_accumulation_steps is the actual training batch size, where loss from multiple forward pass is accumulated. This is NOT limited by the gpu memory.
From this example, you can see how optimizer.zero_grad() may followed by optimizer.step() but NOT loss.backward(). loss.backward() is invoked in every single iteration (line 216) but optimizer.zero_grad() and optimizer.step() is only invoked when the number of accumulated train batch equals the gradient_accumulation_steps (line 227 inside the if block in line 219)
https://github.com/google-research/xtreme/blob/master/third_party/run_classify.py
Also someone is asking about equivalent method in TensorFlow. I guess tf.GradientTape serve the same purpose.
(I am still new to AI library, please correct me if anything I said is wrong)
zero_grad() restarts looping without losses from the last step if you use the gradient method for decreasing the error (or losses).
If you do not use zero_grad() the loss will increase not decrease as required.
For example:
If you use zero_grad() you will get the following output:
model training loss is 1.5
model training loss is 1.4
model training loss is 1.3
model training loss is 1.2
If you do not use zero_grad() you will get the following output:
model training loss is 1.4
model training loss is 1.9
model training loss is 2
model training loss is 2.8
model training loss is 3.5
You don't have to call grad_zero() alternatively one can decay the gradients for example:
optimizer = some_pytorch_optimizer
# decay the grads :
for group in optimizer.param_groups:
for p in group['params']:
if p.grad is not None:
''' original code from git:
if set_to_none:
p.grad = None
else:
if p.grad.grad_fn is not None:
p.grad.detach_()
else:
p.grad.requires_grad_(False)
p.grad.zero_()
'''
p.grad = p.grad / 2
this way the learning is much more continues
During the feed forward propagation the weights are assigned to inputs and after the 1st iteration the weights are initialized what the model has learnt seeing the samples(inputs). And when we start back propagation we want to update weights in order to get minimum loss of our cost function. So we clear off our previous weights in order to obtained more better weights. This we keep doing in training and we do not perform this in testing because we have got the weights in training time which is best fitted in our data. Hope this would clear more!
In simple terms We need ZERO_GRAD
because when we start a training loop we do not want past gardients or past results to interfere with our current results beacuse how PyTorch works as it collects/accumulates the gradients on backpropagation and if the past results may mixup and give us the wrong results so we set the gradient to zero every time we go through the loop.
Here is a example:
`
# let us write a training loop
torch.manual_seed(42)
epochs = 200
for epoch in range(epochs):
model_1.train()
y_pred = model_1(X_train)
loss = loss_fn(y_pred,y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
`
In this for loop if we do not set the optimizer to zero every time the past value it may get add up and changes the result.
So we use zero_grad to not face the wrong accumulated results.