How do I mutate the input using gradient descent in PyTorch? - python

I'm new to PyTorch. I learned it uses autograd to automatically calculate the gradients for the gradient descent function.
Instead of adjusting the weights, I would like to mutate the input to achieve a desired output, using gradient descent. So, instead of the weights of neurons changing, I want to keep all of the weights the same and just change the input to minimize the loss.
For example. The network is a trained image classifier with the numbers 0-9. I input random noise, and I want to morph it so that the network considers it a 3 with 60% confidence. I would like to utilize gradient descent to adjust the values of the input (originally noise) until the network considers the input to be a 3, with 60% confidence.
Is there a way to do this?

I assume you know how to do regular training with gradient descent. You only need to change the parameters to be optimized by the optimizer. Something like
# ... Setup your network, load the input
# ...
# Set proper requires_grad -> We train the input, not the parameters
input.requires_grad = True
for p in network.parameters():
p.requires_grad = False
# Setup the optimizer
# Previously we should have SomeOptimizer(net.parameters())
optim = SomeOptimizer([input])
output_that_you_want = ...
actual_output = net(input)
some_loss = SomeLossFunction(output_that_you_want, actual_output)
# ...
# Back-prop and optim.step() as usual

Related

How can I pass the gradient of a tensor as a parameter in the loss function in PyTorch?

I am trying to implement a neural network in PyTorch to solve an ordinary differential equation (ODE). The network architecture is straight-forward. It is just a feed-forward neural network with n inputs and outputs and k layers.
class PINN(torch.nn.Module):
def __init__(self,n):
super().__init__()
# Layers
self.L1=torch.nn.Linear(1,n)
self.L2=torch.nn.Linear(n,n)
self.L3=torch.nn.Linear(n,1)
# Activation functions
self.t=torch.nn.Tanh()
self.r=torch.nn.ReLU()
def forward(self,x):
a1=self.r(self.L1(x))
a2=self.r(self.L2(a1))
a3=self.r(self.L3(a2))+x
return a3
I want to minimize the loss between the gradient of the output of my neural network and the right-hand side of the ODE. I have chosen to work with a mean-squared error loss. I know that PyTorch includes a built-in MSE loss. However, I defined my own loss function since I have to pass in the gradient of a tensor.
def ODELoss(x,y,x0):
# Number of collocation points to sample
n=len(x)
# Initialize the loss to zero
loss=torch.tensor(0.,requires_grad=True)
# Loop over the "data". Technically, this is an unsupervised problem.
# The "data" are points sampled on the domain which are then evaluated
# according to the ODE.
for (xx,yy) in zip(x,y):
xx=torch.tensor([[xx]],requires_grad=True)
yy=torch.tensor([[yy]])
g(xx,x0).backward()
dg=xx.grad.clone().requires_grad_(True)
loss=loss+(dg-yy)**2
loss=loss/n
return loss
Here, g(x) is called the universal predictor. It is used in the literature to account for the initial condition(s).
def g(x,x0):
return x*model(x)+x0
This doesn't seem to work because it seems like I am not passing in the gradient of the output correctly. Can anyone give me some guidance on how to do this?

Why my one-filter convolutional neural network is unable to learn a simple gaussian kernel?

I was surprised that the deep learning algorithms I had implemented did not work, and I decided to create a very simple example, to understand the functioning of CNN better. Here is my attempt of constructing a small CNN for a very simple task, which provides unexpected results.
I have implemented a simple CNN with only one layer of one filter. I have created a dataset of 5000 samples, the inputs x being 256x256 simulated images, and the outputs y being the corresponding blurred images (y = signal.convolvded2d(x,gaussian_kernel,boundary='fill',mode='same')).
Thus, I would like my CNN to learn the convolutional filter which would transform the original image into its blurred version. In other words, I would like my CNN to recover the gaussian filter I used to create the blurred images. Note: As I want to 'imitate' the convolution process such as it is described in the mathematical framework, I am using a gaussian filter which has the same size as my images: 256x256.
It seems to me quite an easy task, and nonetheless, the CNN is unable to provide the results I would expect. Please find below the code of my training function and the results.
# Parameters
size_image = 256
normalization = 1
sigma = 7
n_train = 4900
ind_samples_training =np.linspace(1, n_train, n_train).astype(int)
nb_epochs = 5
minibatch_size = 5
learning_rate = np.logspace(-3,-5,nb_epochs)
tf.reset_default_graph()
tf.set_random_seed(1)
seed = 3
n_train = len(ind_samples_training)
costs = []
# Create Placeholders of the correct shape
X = tf.placeholder(tf.float64, shape=(None, size_image, size_image, 1), name = 'X')
Y_blur_true = tf.placeholder(tf.float64, shape=(None, size_image, size_image, 1), name = 'Y_true')
learning_rate_placeholder = tf.placeholder(tf.float32, shape=[])
# parameters to learn --should be an approximation of the gaussian filter
filter_to_learn = tf.get_variable('filter_to_learn',\
shape = [size_image,size_image,1,1],\
dtype = tf.float64,\
initializer = tf.contrib.layers.xavier_initializer(seed = 0),\
trainable = True)
# Forward propagation: Build the forward propagation in the tensorflow graph
Y_blur_hat = tf.nn.conv2d(X, filter_to_learn, strides = [1,1,1,1], padding = 'SAME')
# Cost function: Add cost function to tensorflow graph
cost = tf.losses.mean_squared_error(Y_blur_true,Y_blur_hat,weights=1.0)
# Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer that minimizes the cost.
opt_adam = tf.train.AdamOptimizer(learning_rate=learning_rate_placeholder)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optimizer = opt_adam.minimize(cost)
# Initialize all the variables globally
init = tf.global_variables_initializer()
lr = learning_rate[0]
# Start the session to compute the tensorflow graph
with tf.Session() as sess:
# Run the initialization
sess.run(init)
# Do the training loop
for epoch in range(nb_epochs):
minibatch_cost = 0.
seed = seed + 1
permutation = list(np.random.permutation(n_train))
shuffled_ind_samples = np.array(ind_samples_training)[permutation]
# Learning rate update
if learning_rate.shape[0]>1:
lr = learning_rate[epoch]
nb_minibatches = int(np.ceil(n_train/minibatch_size))
for num_minibatch in range(nb_minibatches):
# Minibatch indices
ind_minibatch = shuffled_ind_samples[num_minibatch*minibatch_size:(num_minibatch+1)*minibatch_size]
# Loading of the original image (X) and the blurred image (Y)
minibatch_X, minibatch_Y = load_dataset_blur(ind_minibatch,size_image, normalization, sigma)
_ , temp_cost, filter_learnt = sess.run([optimizer,cost,filter_to_learn],\
feed_dict = {X:minibatch_X, Y_blur_true:minibatch_Y, learning_rate_placeholder: lr})
I have run the training on 5 epochs of 4900 samples, with a batch size equal to 5. The gaussian kernel has a variance of 7^2=49.
I have tried to initialize the filter to be learnt both with the xavier initiliazer method provided by tensorflow, and with the true values of the gaussian kernel we actually would like to learn. In both cases, the filter that is learnt results too different from the true gaussian one as it can be seen on the two images available at https://github.com/megalinier/Helsinki-project.
By examining the photos it seems like the network is learning OK, as the predicted image is not so far off the true label - for better results you can tweak some hyperparams but that is not the case.
I think what you are missing is the fact that different kernels can get quite similar results since it is a convolution.
Think about it, you are multiplying some matrix with another, and then summing all the results to create a new pixel. Now if the true label sum is 10, it could be a results of 2.5 + 2.5 + 2.5 + 2.5 and -10 + 10 + 10 + 0.
What I am trying to say, is that your network could be learning just fine, but you will get a different values in the conv kernel than the filter.
I think this would better serve as a comment as it's somewhat speculative, but it's too long...
Hard to say what exactly is wrong but there could be multiple culprits here. For one, squared error provides a weak signal in the case that target and prediction are already quite similar -- and while the xavier-initalized filter looks quite bad, the predicted (filtered) image isn't too far off the target. You could experiment with other metrics such as absolute error (e.g. 1-norm instead of 2-norm).
Second, adding regularization should help, i.e. add a weight penalty to the loss function to encourage the filter values to become small where they are not needed. As it is, what I suppose happens is: The random values in the filter average out to about 0, leading to a similar "filtering" effect as if they were actually all 0. As such, the learning algorithm doesn't have much incentive to actually pull them to 0. By adding a weight penalty, you provide this incentive.
Third, it could just be Adam messing up. It is known to provide "strange" non-optimal solutions in some very simple (e.g. convex) problems. Maybe try default Gradient Descent with learning rate decay (and possibly momentum).

Incorporate side conditions into Keras neural network

I want to train my neural network (in Keras) with an additional condition on the output elements.
An example:
Minimize my loss function MSE between network output y_pred and y_true.
Additionally, ensure that the norm of y_pred is less or equal 1.
Without the condition, the task is straightforward.
Note: The condition is not necessarily the vector norm of y_pred.
How can I implement the additional condition/restriction in a Keras (or maybe Tensorflow) model?
In principle, tensorflow (and keras) don't allow you to add hard constraints to your model.
You have to convert your invarient (norm <= 1) to a penalty function, which is added to the loss. This could look like this:
y_norm = tf.norm(y_pred)
norm_loss = tf.where(y_norm > 1, y_norm, 0)
total_loss = mse + norm_loss
Look at the docs of where. If your prediction has a norm bigger than one, backpropagation tries to minimize the norm. If it is less than or equal, this part of the loss is simply 0. No gradient is produced.
But this can be very hard to optimize. Your predictions could oscillate around a norm of 1. It is also possible to add a factor: total_loss = mse + 1000* norm_loss. Be very careful with this, it makes optimization even harder.
In the example above, the norm above one contributes linearly to the loss. This is called l1-regularization. You could also square it, which would become l2-regularization.
In your specific case, you could get creative. Why not normalize your predictions and the targets to one (just a suggestion, might be a bad idea)?
loss = mse(y_pred / tf.norm(y_pred), y_target / np.linalg.norm(y_target)

How is embedding matrix being trained in this code snippet?

I'm following the code of a coursera assignment which implements a NER tagger using a bidirectional LSTM.
But I'm not able to understand how the embedding matrix is being updated. In the following code, build_layers has a variable embedding_matrix_variable which acts an input the the LSTM. However it's not getting updated anywhere.
Can you help me understand how embeddings are being trained?
def build_layers(self, vocabulary_size, embedding_dim, n_hidden_rnn, n_tags):
initial_embedding_matrix = np.random.randn(vocabulary_size, embedding_dim) / np.sqrt(embedding_dim)
embedding_matrix_variable = tf.Variable(initial_embedding_matrix, name='embedding_matrix', dtype=tf.float32)
forward_cell = tf.nn.rnn_cell.DropoutWrapper(
tf.nn.rnn_cell.BasicLSTMCell(num_units=n_hidden_rnn, forget_bias=3.0),
input_keep_prob=self.dropout_ph,
output_keep_prob=self.dropout_ph,
state_keep_prob=self.dropout_ph
)
backward_cell = tf.nn.rnn_cell.DropoutWrapper(
tf.nn.rnn_cell.BasicLSTMCell(num_units=n_hidden_rnn, forget_bias=3.0),
input_keep_prob=self.dropout_ph,
output_keep_prob=self.dropout_ph,
state_keep_prob=self.dropout_ph
)
embeddings = tf.nn.embedding_lookup(embedding_matrix_variable, self.input_batch)
(rnn_output_fw, rnn_output_bw), _ = tf.nn.bidirectional_dynamic_rnn(
cell_fw=forward_cell, cell_bw=backward_cell,
dtype=tf.float32,
inputs=embeddings,
sequence_length=self.lengths
)
rnn_output = tf.concat([rnn_output_fw, rnn_output_bw], axis=2)
self.logits = tf.layers.dense(rnn_output, n_tags, activation=None)
def compute_loss(self, n_tags, PAD_index):
"""Computes masked cross-entopy loss with logits."""
ground_truth_tags_one_hot = tf.one_hot(self.ground_truth_tags, n_tags)
loss_tensor = tf.nn.softmax_cross_entropy_with_logits(labels=ground_truth_tags_one_hot, logits=self.logits)
mask = tf.cast(tf.not_equal(self.input_batch, PAD_index), tf.float32)
self.loss = tf.reduce_mean(tf.reduce_sum(tf.multiply(loss_tensor, mask), axis=-1) / tf.reduce_sum(mask, axis=-1))
In TensorFlow, variables are not usually updated directly (i.e. by manually setting them to a certain value), but rather they are trained using an optimization algorithm and automatic differentiation.
When you define a tf.Variable, you are adding a node (that maintains a state) to the computational graph. At training time, if the loss node depends on the state of the variable that you defined, TensorFlow will compute the gradient of the loss function with respect to that variable by automatically following the chain rule through the computational graph. Then, the optimization algorithm will make use of the computed gradients to update the values of the trainable variables that took part in the computation of the loss.
Concretely, the code that you provide builds a TensorFlow graph in which the loss self.loss depends on the weights in embedding_matrix_variable (i.e. there is a path between these nodes in the graph), so TensorFlow will compute the gradient with respect to this variable, and the optimizer will update its values when minimizing the loss. It might be useful to inspect the TensorFlow graph using TensorBoard.

Why do we need to call zero_grad() in PyTorch?

Why does zero_grad() need to be called during training?
| zero_grad(self)
| Sets gradients of all model parameters to zero.
In PyTorch, for every mini-batch during the training phase, we typically want to explicitly set the gradients to zero before starting to do backpropragation (i.e., updating the Weights and biases) because PyTorch accumulates the gradients on subsequent backward passes. This accumulating behaviour is convenient while training RNNs or when we want to compute the gradient of the loss summed over multiple mini-batches. So, the default action has been set to accumulate (i.e. sum) the gradients on every loss.backward() call.
Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly. Otherwise, the gradient would be a combination of the old gradient, which you have already used to update your model parameters, and the newly-computed gradient. It would therefore point in some other direction than the intended direction towards the minimum (or maximum, in case of maximization objectives).
Here is a simple example:
import torch
from torch.autograd import Variable
import torch.optim as optim
def linear_model(x, W, b):
return torch.matmul(x, W) + b
data, targets = ...
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
optimizer = optim.Adam([W, b])
for sample, target in zip(data, targets):
# clear out the gradients of all Variables
# in this optimizer (i.e. W, b)
optimizer.zero_grad()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
optimizer.step()
Alternatively, if you're doing a vanilla gradient descent, then:
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
for sample, target in zip(data, targets):
# clear out the gradients of Variables
# (i.e. W, b)
W.grad.data.zero_()
b.grad.data.zero_()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
W -= learning_rate * W.grad.data
b -= learning_rate * b.grad.data
Note:
The accumulation (i.e., sum) of gradients happens when .backward() is called on the loss tensor.
As of v1.7.0, Pytorch offers the option to reset the gradients to None optimizer.zero_grad(set_to_none=True) instead of filling them with a tensor of zeroes. The docs claim that this setting reduces memory requirements and slightly improves performance, but might be error-prone if not handled carefully.
Although the idea can be derived from the chosen answer, but I feel like I want to write that explicitly.
Being able to decide when to call optimizer.zero_grad() and optimizer.step() provides more freedom on how gradient is accumulated and applied by the optimizer in the training loop. This is crucial when the model or input data is big and one actual training batch do not fit in to the gpu card.
Here in this example from google-research, there are two arguments, named train_batch_size and gradient_accumulation_steps.
train_batch_size is the batch size for the forward pass, following the loss.backward(). This is limited by the gpu memory.
gradient_accumulation_steps is the actual training batch size, where loss from multiple forward pass is accumulated. This is NOT limited by the gpu memory.
From this example, you can see how optimizer.zero_grad() may followed by optimizer.step() but NOT loss.backward(). loss.backward() is invoked in every single iteration (line 216) but optimizer.zero_grad() and optimizer.step() is only invoked when the number of accumulated train batch equals the gradient_accumulation_steps (line 227 inside the if block in line 219)
https://github.com/google-research/xtreme/blob/master/third_party/run_classify.py
Also someone is asking about equivalent method in TensorFlow. I guess tf.GradientTape serve the same purpose.
(I am still new to AI library, please correct me if anything I said is wrong)
zero_grad() restarts looping without losses from the last step if you use the gradient method for decreasing the error (or losses).
If you do not use zero_grad() the loss will increase not decrease as required.
For example:
If you use zero_grad() you will get the following output:
model training loss is 1.5
model training loss is 1.4
model training loss is 1.3
model training loss is 1.2
If you do not use zero_grad() you will get the following output:
model training loss is 1.4
model training loss is 1.9
model training loss is 2
model training loss is 2.8
model training loss is 3.5
You don't have to call grad_zero() alternatively one can decay the gradients for example:
optimizer = some_pytorch_optimizer
# decay the grads :
for group in optimizer.param_groups:
for p in group['params']:
if p.grad is not None:
''' original code from git:
if set_to_none:
p.grad = None
else:
if p.grad.grad_fn is not None:
p.grad.detach_()
else:
p.grad.requires_grad_(False)
p.grad.zero_()
'''
p.grad = p.grad / 2
this way the learning is much more continues
During the feed forward propagation the weights are assigned to inputs and after the 1st iteration the weights are initialized what the model has learnt seeing the samples(inputs). And when we start back propagation we want to update weights in order to get minimum loss of our cost function. So we clear off our previous weights in order to obtained more better weights. This we keep doing in training and we do not perform this in testing because we have got the weights in training time which is best fitted in our data. Hope this would clear more!
In simple terms We need ZERO_GRAD
because when we start a training loop we do not want past gardients or past results to interfere with our current results beacuse how PyTorch works as it collects/accumulates the gradients on backpropagation and if the past results may mixup and give us the wrong results so we set the gradient to zero every time we go through the loop.
Here is a example:
`
# let us write a training loop
torch.manual_seed(42)
epochs = 200
for epoch in range(epochs):
model_1.train()
y_pred = model_1(X_train)
loss = loss_fn(y_pred,y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
`
In this for loop if we do not set the optimizer to zero every time the past value it may get add up and changes the result.
So we use zero_grad to not face the wrong accumulated results.

Categories

Resources