I'm trying to implement linear classifier in PyTorch, using 1 layer with tensors W and b, softmax and cross entropy loss. For each batch I have to:
Calculate logits
Transform logits to probabilities with softmax
Compute most probable classes
Calculate cross entropy between true and predicted classes
Use an optimizer to change W and b
So far I have (I have flat MNIST loaded with Scikit-learn):
# convert Numpy arrays to PyTorch tensor Variables
input_X_train = torch.from_numpy(X_train_flat).float().to(device)
input_X_val = torch.from_numpy(X_val_flat).float().to(device)
input_X_test = torch.from_numpy(X_test_flat).float().to(device)
input_y_train = torch.from_numpy(y_train).long().to(device)
input_y_val = torch.from_numpy(y_val).long().to(device)
input_y_test = torch.from_numpy(y_test).long().to(device)
# model parameters: W and b
W = torch.randn(input_dim, output_dim, device=device, dtype=dtype, requires_grad=True)
b = torch.randn(1, device=device, dtype=dtype, requires_grad=True)
BATCH_SIZE = 512
EPOCHS = 40
LEARNING_RATE = 1e-6
# create torch.optim.Adam optimizer for loss function minimization
optimizer = torch.optim.Adam([W, b], lr=LEARNING_RATE)
# create negative log loss function object for loss function evaluation
# use mean loss value from all batch samples
loss_fn = torch.nn.NLLLoss(reduction="mean")
for t in range(EPOCHS):
# logits for input_X, resulting shape should be [input_X.shape[0], 10]
logits = torch.matmul(input_X_train, W) + b
# apply torch.nn.functional.softmax (torch_F.softmax) to logits
probas = torch_f.softmax(logits, dim=1)
# apply torch.argmax to find a class index with highest probability
classes = torch.argmax(probas, dim=1)
# loss should be a scalar number: average loss over all the objects with torch.mean()
# PyTorch implements negative log loss (NLL) *without* log - you have to first compute log of
# softmax, then negative log loss, which will swap sign
# Use torch.nn.functional.log_softmax (torch_f.log_softmax) on top of input_y and logits
# It is identical to calculating cross-entropy (log and then NLL) on top of probas,
# but is more numerically friendly (read the docs).
log_probas = torch_f.log_softmax(logits, dim=1)
loss = loss_fn(log_probas, input_y_train)
# Before the backward pass, use the optimizer object to zero all of the
# gradients for the variables it will update (which are the learnable
# weights of the model). This is because by default, gradients are
# accumulated in buffers( i.e, not overwritten) whenever .backward()
# is called. Checkout docs of torch.autograd.backward for more details.
optimizer.zero_grad()
# calculate backward gradients for backpropagation
loss.backward()
# Calling the step function on an Optimizer makes an update to its parameters
optimizer.step()
For some reason, the W and b don't change. What am I doing wrong?
EDIT:
I've seen and tried in the code above e. g. this minimal working example https://discuss.pytorch.org/t/minimal-working-example-of-optim-sgd/11623/2.
EDIT 2:
Gradients W.grad are often, I think it should not be like that. Probabilities of classes are definitely right (so it's not e. g. like this example), since I've checked sum of every row and probabilities of all classes for each sample sum to 1.
Related
I am replicating a paper. I have a basic Keras CNN model for MNIST classification. Now for sample z in the training, I want to calculate the hessian matrix of the model parameters with respect to the loss of that sample. I want to average out this hessian over the training data (n is number of training data).
My final goal is to calculate this value (the influence score):
I can calculate the left term and the right term and want to compute the Hessian term. I don't know how to calculate hessian for the model weights for a batch of examples (vectorization). I was able to calculate it only for a sample at a time which is too slow.
x=tf.convert_to_tensor(x_train[0:13])
with tf.GradientTape() as t2:
with tf.GradientTape() as t1:
y=model(x)
mce = tf.keras.losses.CategoricalCrossentropy()
y_expanded=y_train[train_idx]
loss=mce(y_expanded,y)
g = t1.gradient(loss, model.weights[4])
h = t2.jacobian(g, model.weights[4])
print(h.shape)
For clarification, if a model layer is of dimension 20*30, I want to feed a batch of 13 samples to it and get a Hessian of dimension (13,20,30,20,30). Now I can only get Hessian of dimension (20,30,20,30) which thwarts the vectorization (the code above).
This thread has the same problem, except that I want the second-order derivative rather than the first-order.
I also tried the below script which returns a (13,20,30,20,30) matrix that satisfies the dimension, but when I manually checked the sum of this matrix with the sum of 13 single hessian calculations with a for loop from 0 to 12, they lead to different numbers so it does not work either since I expected equal values.
x=tf.convert_to_tensor(x_train[0:13])
mce = tf.keras.losses.CategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE)
with tf.GradientTape() as t2:
with tf.GradientTape() as t1:
t1.watch(model.weights[4])
y_expanded=y_train[0:13]
y=model(x)
loss=mce(y_expanded,y)
j1=t1.jacobian(loss, model.weights[4])
j3 = t2.jacobian(j1, model.weights[4])
print(j3.shape)
That's how hessians are defined, you can only calculate a hessian of a scalar function.
But nothing new here, the same happens with gradients, and what is done to handle batches is to accumulate the gradients, something similar can be done with the hessian.
If you know how to compute the hessian of the loss, it means you could define batch cost and still be able to compute the hessian with the same method. e.g. you could define your cost as the sum(losses) where losses is the vector of losses for all examples in the batch.
Let's Suppose you have a model and you wanna train the model weights by taking the Hessian of the training images w.r.t trainable-weights
#Import the libraries we need
import tensorflow as tf
from tensorflow.python.eager import forwardprop
model = tf.keras.models.load_model('model.h5')
#Define the Adam Optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.98,
epsilon=1e-9)
#Define the loss function
def loss_function(y_true , y_pred):
return tf.keras.losses.sparse_categorical_crossentropy(y_true , y_pred , from_logits=True)
#Define the Accuracy metric function
def accuracy_function(y_true , y_pred):
return tf.keras.metrics.sparse_categorical_accuracy(y_true , y_pred)
Now, define the variables for storing the mean of the loss and accuracy
train_loss = tf.keras.metrics.Mean(name='loss')
train_accuracy = tf.keras.metrics.Mean(name='accuracy')
#Now compute the Hessian in some different style for better efficiency of the model
vector = [tf.ones_like(v) for v in model.trainable_variables]
def _forward_over_back_hvp(images, labels):
with forwardprop.ForwardAccumulator(model.trainable_variables, vector) as acc:
with tf.GradientTape() as grad_tape:
logits = model(images, training=True)
loss = loss_function(labels ,logits)
grads = grad_tape.gradient(loss, model.trainable_variables)
hessian = acc.jvp(grads)
optimizer.apply_gradients(zip(hessian, model.trainable_variables))
train_loss(loss) #keep adding the loss
train_accuracy(accuracy_function(labels, logits)) #Keep adding the accuracy
#Now, here we need to call the function and train it
import time
for epoch in range(20):
start = time.time()
train_loss.reset_states()
train_accuracy.reset_states()
for i,(x , y) in enumerate(dataset):
_forward_over_back_hvp(x , y)
if(i%50==0):
print(f'Epoch {epoch + 1} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')
print(f'Time taken for 1 epoch: {time.time() - start:.2f} secs\n')
Epoch 1 Loss 2.6396 Accuracy 0.1250
Time is taken for 1 epoch: 0.23 secs
output_1, output_2 = model(x)
loss = cross_entropy_loss(output_1, target_1)
loss.backward()
optimizer.step()
loss = cross_entropy_loss(output_2, target_2)
loss.backward()
optimizer.step()
However, when I run this piece of code, I got this error:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1, 4]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Then, I really wanna know what I am supposed to do to train a model with 2 or more outputs
The entire premise on which pytorch (and other DL frameworks) is founded on is the backporpagation of the gradients of a scalar loss function.
In your case, you have a vector (of dim=2) loss function:
[cross_entropy_loss(output_1, target_1), cross_entropy_loss(output_2, target_2)]
You need to decide how to combine these two losses into a single scalar loss.
For instance:
weight = 0.5 # relative weight
loss = weight * cross_entropy_loss(output_1, target_1) + (1. - weight) * cross_entropy_loss(output_2, target_2)
# now loss is a scalar
loss.backward()
optimizer.step()
In TF2 keras, I have trained an Autoencoder using tensorflow.keras.losses.MeanSquaredError as loss function. Now, I want to further train this model by using another loss function, specifically tensorflow.keras.losses.KLDivergence. The reason for this is that initially unsupervised learning is conducted for representation learning. Then, having the generated embeddings, I can cluster them and use these clusters for self-supervision, i.e. labels, enabling the second, supervised loss and improving the model further.
This is not transfer learning per se, as no new layers are added to the model, just the loss function is changed and the model continues training.
What I have tried is using the pretrained model with the MSE loss as the new model's property:
class ClusterBooster(tf.keras.Model):
def __init__(self, base_model, centers):
super(ClusterBooster, self).__init__()
self.pretrained = base_model
self.centers = centers
def train_step(self, data):
with tf.GradientTape() as tape:
loss = self.compiled_loss(self.P, self.Q, regularization_losses=self.losses)
# Compute gradients
gradients = tape.gradient(loss, self.trainable_variables)
# Update weights
self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))
return {m.name: m.result() for m in self.metrics}
where the loss is the KL loss between distributions P and Q. The distributions are computed in a callback function instead of the model train_step as I need access to the current epoch (P is updated every 5 epochs, not on each epoch):
def on_epoch_begin(self, epoch, logs=None):
z = self.model.pretrained.embed(self.feature, training=True)
z = tf.reshape(z, [tf.shape(z)[0], 1, tf.shape(z)[1]]) # reshape for broadcasting
# CALCULATE Q FOR EVERY EPOCH
partial = tf.math.pow(tf.norm(z - self.model.centers, axis=2, ord='euclidean'), 2)
nominator = 1 / (1 + partial)
denominator = tf.math.reduce_sum(1 / (1 + partial))
self.model.Q = nominator / denominator
# CALCULATE P EVERY 5 EPOCHS TO AVOID INSTABILITY
if epoch % 5 == 0:
partial = tf.math.pow(self.model.Q, 2) / tf.math.reduce_sum(self.model.Q, axis=1, keepdims=True)
nominator = partial
denominator = tf.math.reduce_sum(partial, axis=0)
self.model.P = nominator / denominator
However, when apply_gradients() is executed I get:
ValueError: No gradients provided for any variable: ['dense/kernel:0', 'dense/bias:0', 'dense_1/kernel:0', 'dense_1/bias:0', 'dense_2/kernel:0', 'dense_2/bias:0', 'dense_3/kernel:0', 'dense_3/bias:0']
I think that this is due to the fact that the pretrained model is not set to be further trained somewhere inside the new model (only the embed() method is called, which does not train the model). Is this a correct approach and I am just missing something or is there a better way?
It seems that whatever computation takes place in a callback, isn't tracked for gradient computation and weight updating. Thus, these computations should be put inside the train_step() function of the custom Model class (ClusterBooster).
Providing that I don't have access to the number of epochs inside the train_step() function of ClusterBooster, I created a custom training loop without a Model class, where I could use plain python code (which is computed eagerly).
I'm looking at the policy gradients sample in this notebook: https://github.com/ageron/handson-ml/blob/master/16_reinforcement_learning.ipynb
The relevant code is here:
X = tf.placeholder(tf.float32, shape=[None, n_inputs])
hidden = tf.layers.dense(X, n_hidden, activation=tf.nn.elu, kernel_initializer=initializer)
logits = tf.layers.dense(hidden, n_outputs)
outputs = tf.nn.sigmoid(logits) # probability of action 0 (left)
p_left_and_right = tf.concat(axis=1, values=[outputs, 1 - outputs])
action = tf.multinomial(tf.log(p_left_and_right), num_samples=1)
y = 1. - tf.to_float(action)
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=logits)
optimizer = tf.train.AdamOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(cross_entropy)
gradients = [grad for grad, variable in grads_and_vars]
gradient_placeholders = []
grads_and_vars_feed = []
for grad, variable in grads_and_vars:
gradient_placeholder = tf.placeholder(tf.float32, shape=grad.get_shape())
gradient_placeholders.append(gradient_placeholder)
grads_and_vars_feed.append((gradient_placeholder, variable))
training_op = optimizer.apply_gradients(grads_and_vars_feed)
...
# Run training over a bunch of instances of inputs
for step in range(n_max_steps):
action_val, gradients_val = sess.run([action, gradients], feed_dict={X: obs.reshape(1, n_inputs)})
...
# Then weight each gradient by the action values, average, and feed them back into training_op to apply_gradients()
The above works fine, as each run() returns different gradients.
I'd like to batch all this, and feed an array of inputs into run() instead of one input at a time (my environment is different than the one in the sample, so it makes sense for me to batch, and improve performance). Ie:
action_val, gradients_val = sess.run([action, gradients], feed_dict={X: obs_array})
Where obs_array has shape [n_instances, n_inputs].
The problem is that optimizer.compute_gradients(cross_entropy) seems to return a single gradient, even though cross_entropy is a 1d tensor of shape [None, 1]. action_val does return a 1d tensor of actions, as expected - one action per instance in the batch.
Is there any way for me to get an array of gradients, one per instance in the batch?
The problem is that optimizer.compute_gradients(cross_entropy) seems to return a single gradient, even though cross_entropy is a 1d tensor of shape [None, 1].
That happens by design, as the gradient terms for each tensor are automatically aggregated. Gradient computation operations such as optimizer.compute_gradients and the low-level primitive tf.gradients make a sum of all gradient operations, according to the default AddN aggregation method. This is fine for most cases of stochastic gradient descent.
In the end unfortunately, gradient computation will have to be made over a single batch. Of course, unless a custom gradient function is built, or the TensorFlow API is extended to provide gradient computation without full aggregation. Changing the implementation of tf.gradients to do this does not seem to be very trivial.
One trick that you might wish to employ for your reinforcement learning model is to perform multiple session runs in parallel. According to the FAQ, the Session API supports multiple concurrent steps, and will take advantage of the existing resources for parallel computation. The question Asynchronous computation in TensorFlow shows how to do this.
One weak solution I came up with is to create an array of gradient operations, one per instance in the batch, which I can then run all at the same time:
X = tf.placeholder(tf.float32, shape=[minibatch_size, n_inputs])
hidden = tf.layers.dense(X, n_hidden, activation=tf.nn.elu, kernel_initializer=initializer)
hidden2 = tf.layers.dense(hidden, n_hidden, activation=tf.nn.elu, kernel_initializer=initializer)
logits = tf.layers.dense(hidden2, n_outputs)
outputs = tf.nn.sigmoid(logits) # probability of action 0
p_left_and_right = tf.concat(axis=1, values=[outputs, 1 - outputs])
action = tf.multinomial(tf.log(p_left_and_right), num_samples=1)
y = 1. - tf.to_float(action)
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=logits)
optimizer = tf.train.AdamOptimizer(learning_rate)
# Calculate gradients per batch instance - for minibatch training
batch_gradients = []
for instance_cross_entropy in tf.unstack(cross_entropy):
instance_grads_and_vars = optimizer.compute_gradients(instance_cross_entropy)
instance_gradients = [grad for grad, variable in instance_grads_and_vars]
batch_gradients.append(instance_gradients)
# Calculate gradients for just one instance - for single instance training
grads_and_vars = optimizer.compute_gradients(cross_entropy)
gradients = [grad for grad, variable in grads_and_vars]
# Create gradient placeholders
gradient_placeholders = []
grads_and_vars_feed = []
for grad, variable in grads_and_vars:
gradient_placeholder = tf.placeholder(tf.float32, shape=grad.get_shape())
gradient_placeholders.append(gradient_placeholder)
grads_and_vars_feed.append((gradient_placeholder, variable))
# In the end we only apply a single set of averaged gradients
training_op = optimizer.apply_gradients(grads_and_vars_feed)
...
while step < len(obs_array) - minibatch_size:
action_array, batch_gradients_array = sess.run([action, batch_gradients], feed_dict={X: obs_array[step:step+minibatch_size]})
for action_val, gradient in zip(action_array, batch_gradients_array):
action_vals.append(action_val)
current_gradients.append(gradient)
step += minibatch_size
The main points are that I need to specify the batch size for placeholder X, I can't leave it open ended, otherwise unstack has no idea how many elements to unstack. I unstack cross_entropy to get cross_entropy per instance, then I call compute_gradients per instance. During training I run([action, batch_gradients], feed_dict={X: obs_array[step:step+minibatch_size]}), which gives me the separate gradients per batch.
This is all well and good, but it doesn't give me much of a performance boost. I only get a max speedup of 2x. Increasing the batch size past 5 just scales the runtime of run() linearly, and gives no gain.
It's sad that Tensorflow can calculate and aggregate gradients over hundreds of instances blazingly fast, but requesting the gradients one by one is so much slower. Might need to dig into the source next...
Considering the example code.
I would like to know How to apply gradient clipping on this network on the RNN where there is a possibility of exploding gradients.
tf.clip_by_value(t, clip_value_min, clip_value_max, name=None)
This is an example that could be used but where do I introduce this ?
In the def of RNN
lstm_cell = rnn_cell.BasicLSTMCell(n_hidden, forget_bias=1.0)
# Split data because rnn cell needs a list of inputs for the RNN inner loop
_X = tf.split(0, n_steps, _X) # n_steps
tf.clip_by_value(_X, -1, 1, name=None)
But this doesn't make sense as the tensor _X is the input and not the grad what is to be clipped?
Do I have to define my own Optimizer for this or is there a simpler option?
Gradient clipping needs to happen after computing the gradients, but before applying them to update the model's parameters. In your example, both of those things are handled by the AdamOptimizer.minimize() method.
In order to clip your gradients you'll need to explicitly compute, clip, and apply them as described in this section in TensorFlow's API documentation. Specifically you'll need to substitute the call to the minimize() method with something like the following:
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gvs = optimizer.compute_gradients(cost)
capped_gvs = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gvs]
train_op = optimizer.apply_gradients(capped_gvs)
Despite what seems to be popular, you probably want to clip the whole gradient by its global norm:
optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimize = optimizer.apply_gradients(zip(gradients, variables))
Clipping each gradient matrix individually changes their relative scale but is also possible:
optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients = [
None if gradient is None else tf.clip_by_norm(gradient, 5.0)
for gradient in gradients]
optimize = optimizer.apply_gradients(zip(gradients, variables))
In TensorFlow 2, a tape computes the gradients, the optimizers come from Keras, and we don't need to store the update op because it runs automatically without passing it to a session:
optimizer = tf.keras.optimizers.Adam(1e-3)
# ...
with tf.GradientTape() as tape:
loss = ...
variables = ...
gradients = tape.gradient(loss, variables)
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimizer.apply_gradients(zip(gradients, variables))
It's easy for tf.keras!
optimizer = tf.keras.optimizers.Adam(clipvalue=1.0)
This optimizer will clip all gradients to values between [-1.0, 1.0].
See the docs.
This is actually properly explained in the documentation.:
Calling minimize() takes care of both computing the gradients and
applying them to the variables. If you want to process the gradients
before applying them you can instead use the optimizer in three steps:
Compute the gradients with compute_gradients().
Process the gradients as you wish.
Apply the processed gradients with apply_gradients().
And in the example they provide they use these 3 steps:
# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)
Here MyCapper is any function that caps your gradient. The list of useful functions (other than tf.clip_by_value()) is here.
For those who would like to understand the idea of gradient clipping (by norm):
Whenever the gradient norm is greater than a particular threshold, we clip the gradient norm so that it stays within the threshold. This threshold is sometimes set to 5.
Let the gradient be g and the max_norm_threshold be j.
Now, if ||g|| > j , we do:
g = ( j * g ) / ||g||
This is the implementation done in tf.clip_by_norm
IMO the best solution is wrapping your optimizer with TF's estimator decorator tf.contrib.estimator.clip_gradients_by_norm:
original_optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(original_optimizer, clip_norm=5.0)
train_op = optimizer.minimize(loss)
This way you only have to define this once, and not run it after every gradients calculation.
Documentation:
https://www.tensorflow.org/api_docs/python/tf/contrib/estimator/clip_gradients_by_norm
Gradient Clipping basically helps in case of exploding or vanishing gradients.Say your loss is too high which will result in exponential gradients to flow through the network which may result in Nan values . To overcome this we clip gradients within a specific range (-1 to 1 or any range as per condition) .
clipped_value=tf.clip_by_value(grad, -range, +range), var) for grad, var in grads_and_vars
where grads _and_vars are the pairs of gradients (which you calculate via tf.compute_gradients) and their variables they will be applied to.
After clipping we simply apply its value using an optimizer.
optimizer.apply_gradients(clipped_value)
Method 1
if you are training your model using your custom training loop then the one update step will look like
'''
for loop over full dataset
X -> training samples
y -> labels
'''
optimizer = tf.keras.optimizers.Adam()
for x, y in train_Data:
with tf.GradientTape() as tape:
prob = model(x, training=True)
# calculate loss
train_loss_value = loss_fn(y, prob)
# get gradients
gradients = tape.gradient(train_loss_value, model.trainable_weights)
# clip gradients if you want to clip by norm
gradients = [(tf.clip_by_norm(grad, clip_norm=1.0)) for grad in gradients]
# clip gradients via values
gradients = [(tf.clip_by_value(grad, clip_value_min=-1.0, clip_value_max=1.0)) for grad in gradients]
# apply gradients
optimizer.apply_gradients(zip(gradients, model.trainable_weights))
Method 2
Or you could also simply just replace the first line in above code as below
# for clipping by norm
optimizer = tf.keras.optimizers.Adam(clipnorm=1.0)
# for clipping by value
optimizer = tf.keras.optimizers.Adam(clipvalue=0.5)
second method will also work if you are using model.compile -> model.fit pipeline.