How to calculate Gradient of the loss with respect to input?

How to calculate Gradient of the loss with respect to input? - python

I have a pre-trained PyTorch model. I need to calculate the gradient of the loss with respect to the network's inputs using this model (without training again and only using the pre-trained model).
I wrote the following code, but I am not sure it is correct or not.
test_X, test_y = load_data(mode='test')
testset_original = MyDataset(test_X, test_y, transform=default_transform)
testloader = DataLoader(testset_original, batch_size=32, shuffle=True)
model = MyModel(device=device).to(device)
checkpoint = torch.load('checkpoint.pt')
model.load_state_dict(checkpoint['model_state_dict'])
gradient_losses = []
for i, data in enumerate(testloader):
inputs, labels = data
inputs= inputs.to(device)
labels = labels.to(device)
inputs.requires_grad = True
output = model(inputs)
loss = loss_function(output)
loss.backward()
gradient_losses.append(inputs.grad)
My question is, does this list gradient_losses actually storing what I wish to store? If not, what is the correct way to do that?

does this list gradient_losses actually storing what I wish to store?
Yes, if you are looking to get the derivative of the loss with respect to the input then that seems to be the correct way to do it. Here is minimal example, take f(x) = a*x. Then df/dx = a.
>>> x = torch.rand(10, requires_grad=True)
>>> y = torch.rand(10)
>>> a = torch.tensor([3.], requires_grad=True)
>>> loss = a*x - y
>>> loss.mean().backward()
>>> x.grad
tensor([0.3000, 0.3000, ..., 0.3000, 0.3000])
Which, in this case is equal to a / len(x)
Do note, each gradient you extract with input.grad will be averaged over the whole batch, and won't be a gradient over each individual input.
Also, you don't need to .clone() your input gradients as they are not part of the model and won't get zeroed by model.zero_grad().

Related

How to make predictions on new dataset with tensorflow's gradient tape

While I'm able to understand how to use model.fit(x_train, y_train), I can't figure out how to make predictions on new data using tensorflow's gradient tape. My github repository with runnable code (up to an error) can be found here. What is currently working is that I get the trained model "network_output", however it appears that with gradient tape, argmax is being used on the model itself, where I'm used to model.fit() taking the test data as an input:
network_output = trained_network(input_images,input_number)
preds = np.argmax(network_output, axis=1)
Where "input_images" is an ndarray: (20,3,3,1) and "input_number" is an ndarray: (20,5).
Now I'm taking network_output as the trained model and would like to use it to predict similarly typed data of test_images, and test_number respectively.
The error 'tensorflow.python.framework.ops.EagerTensor' object has no attribute 'predict' here:
predicted_number = network_output.predict(test_images)
Which is because I don't know how to use the tape to make predictions. However once the prediction works I would guess I can compare the resulting "predicted_number" against the "test_number" as would usually be done using the model.fit method.
acc = 0
for i in range(len(test_images)):
if (predicted_number[i] == test_number[i]):
acc += 1
print("Accuracy: ", acc / len(input_images) * 100, "%")

In order to obtain prediction I usually iterate through batches manually like this:
predictions = []
for batch in range(num_batch):
logits = trained_network(x_test[batch * batch_size: (batch + 1) * batch_size], training=False)
# first obtain probabilities
# (if the last layer of the network has no activation, otherwise skip the softmax here)
prob = tf.nn.softmax(logits)
# putting back together predictions for all batches
predictions.extend(tf.argmax(input=prob, axis=1))
If you don't have a lot of data you can skip the loop, this is faster than using predict because you directly invoke the __call__ method of the model:
logits = trained_network(x_test, training=False)
prob = tf.nn.softmax(logits)
predictions = tf.argmax(input=prob, axis=1)
Finally you could also use predict. In this case the batches are handled automatically. It is easier to use when you have lots of data since you don't have to create a loop to interate through batches. The result is a numpy array of predictions. In can be used like this:
predictions = trained_network.predict(x_test) # you can set a batch_size if you want
What you're doing wrong is this part:
network_output = trained_network(input_images,input_number)
predicted_number = network_output.predict(test_images)
You have to call predict directly on your model trained_network.

Tensorflow : result of training data(using sigmoid) came out reversely

I tried to train my data using 'Gradient Descent Algorithm' to minimize cost value,
and strangely enough, result came out differently depending on the number of steps.
Below is my training code:
import tensorflow as tf
X = tf.placeholder(tf.float32, shape=[None, 2], name="X")
Y = tf.placeholder(tf.float32, shape=[None, 1], name="Y")
W = tf.Variable(tf.random_normal([2, 1]), name="weight")
b = tf.Variable(tf.random_normal([1]), name="bias")
hypo = tf.sigmoid(tf.matmul(X, W) +b)
cost = -tf.reduce_mean(Y*(tf.log*(hypo)) + (1-Y)*(tf.log(1-hypo)))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1e-3)
train = optimizer.minimize(cost)
#### Saving model
SAVER_DIR = "model"
saver = tf.train.Saver()
checkpoint_path = os.path.join(SAVER_DIR, "model")
ckpt = tf.train.get_checkpoint_state(SAVER_DIR)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for step in range(4201):
cost_val, hy_val, _ = sess.run([cost, hypo, train], feed_dict={X:x_data, Y=y_data})
saver.save(sess, checkpoint_path, global_step=step)
And restore model:
saver = tf.train.import_meta_graph('./model/model-4200.meta')
saver.restore(sess,'./model/model-4200')
result = sess.run(hypo, feed_dict={X: x_data_test})
fig, ax = plt.subplots()
ax.plot(Julian_test,y_data_test,'ro-') # Correct answer. all items are one of the two:0 or 1.
ax.plot(Julian_test,result,'bo-') # Result of training. Predict answer within
plt.show() # sigmoid function, so all items are in range of 0 ~ 1.
As figure is showing, the result of sigmoid is reverse.
But, when I changed the number of steps to 5000, (in my above code, I only changed step.)
result came out correctly.
I can't understand why it makes defference. Did I miss something? Need help indeed!

in simple terms by increasing the steps you are allowing your tensorflow code/model to see the data multiple times, hence giving it the capabilities to learn more insights about the data. and generalize its representation.
E.G
Lets say you give your model 2000 steps and at the end of 2000 steps it finds a minimum and your model stops there. but what if the minimum cost that the model has find till now is not the global minimum, we cant say cause we restricted it to 2000 steps. so lets say you increase the steps to 20000 and model now finds another minimum which gives more accurate results.
But you need to make sure that your model does-not overfit i.e giving accuracy on your training data but not on your validation set. (So make sure not to increase num steps by too much).

Keras - preprocessing and scaling for forked architecture

I have data set that has two inputs x1,x2 and output that has 1 binary value (0,1) and 45 real numbers (output vector has 46 attibutes in summary). I would like to use different loss functions for this 1 binary value and 45 real numbers, namely binary crossentropy and mean squared error.
My knowledge of Keras is very limited, so I am not even sure if this is the architecture I want. Is this the right way of doing this?
first, preprocessing:
# load dataset
dataframe = pandas.read_csv("inputs.csv", delim_whitespace=True,header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:2]
Y = dataset[:,3:]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2,
random_state=123)
y_train_L, y_train_R = y_train[:,0], y_train[:,1:]
y_train_L = y_train_L.reshape(-1,1)
scalarX, scalarY_L, scalarY_R = MinMaxScaler(), MinMaxScaler(), MinMaxScaler()
scalarX.fit(x_train)
scalarY_L.fit(y_train_L)
scalarY_R.fit(y_train_R)
x_train = scalarX.transform(x_train)
y_train_L = scalarY_L.transform(y_train_L)
y_train_R = scalarY_R.transform(y_train_R)
where y_train_L is left part are just binary values and y_train_R are real numbers. I had to split them because when defining architecture:
# define and fit the final model
inputs = Input(shape=(x_train.shape[1],))
first =Dense(46, activation='relu')(inputs)
#last
layer45 = Dense(45, activation='linear')(first)
layer1 = Dense(1, activation='tanh')(first)
out = [layer1,layer45]
#end last
model = Model(inputs=inputs,outputs=out)
model.compile(loss=['binary_crossentropy','mean_squared_error'], optimizer='adam')
model.fit(x_train, [y_train_L,y_train_R], epochs=1000, verbose=1)
Xnew = scalarX.transform(x_test)
y_test_L, y_test_R = y_test[:,0], y_test[:,1:]
y_test_L = y_test_L.reshape(-1,1)
y_test_L=scalarY_L.transform(y_test_L)
y_test_R=scalarY_R.transform(y_test_R)
# make a prediction
ynew = model.predict(Xnew)
loss=['binary_crossentropy','mean_squared_error'] expects two different arrays in model.fit(x_train, [y_train_L,y_train_R])
then i have to do all the 'funny' tricks to get predicted values and compare them next to each other because ynew = model.predict(Xnew) return list of two lists, one for binary values and one for real numbers.
ynew = model.predict(Xnew)
# show the inputs and predicted outputs
print("SCALED VALUES")
for i in range(len(Xnew)):
print("X=%s\n P=%s,%s\n A=%s,%s" % (Xnew[i], ynew[0][i], ynew[1][i], y_test_L[i], y_test_R[i]))
inversed_X_test = scalarX.inverse_transform(Xnew)
inversed_Y_test_L = scalarY_L.inverse_transform(y_test_L)
inversed_Y_test_R = scalarY_R.inverse_transform(y_test_R)
inversed_y_predicted_L = scalarY_L.inverse_transform(ynew[0])
inversed_y_predicted_R = scalarY_R.inverse_transform(ynew[1])
print("REAL VALUES")
for i in range(len(inversed_X_test)):
print("X=%s\n P=%s,%s\n A=%s,%s" % (inversed_X_test[i], inversed_y_predicted_L[i],inversed_y_predicted_R[i], inversed_Y_test_L[i],inversed_Y_test_R[i]))
questions:
Can I achieve this in cleaner way?
How can I measure loss? I would like to create chart of loss values during trening.

1) The way you define your model seems correct and there is no 'cleaner' way of doing it (I would argue that Keras' functional API is as clean as it gets)
2) To visualize training loss, store the history of training in a variable:
history = model.fit(...)
This history object will contain the train and validation losses for each epoch, you can use itto make plots.
3) In your classification output (layer1), you want to use a sigmoid activation instead of tanh. The sigmoid function returns values between 0 and 1, tanh returns values between -1 and 1. Your binary_crossentropy loss function expects the former.

Add Custom Regularization to Tensorflow

I am using tensorflow to optimize a simple least squares objective function like the following:
Here, Y is the target vector ,X is the input matrix and vector w represents the weights to be learned.
Example Scenario:
, ,
If I wanted to augment the initial objective function to impose an additional constraint on w1 (the first scalar value in the tensorflow Variable w and X1 represents the first column of the feature matrix X), how would I achieve this in tensorflow?
One solution I can think of is to use tf.slice to index the first value of $w$ and add this in addition to the original cost term but I am not convinced that it will have the desired effect on the weights.
I would appreciate inputs on whether something like this is possible in tensorflow and if so, what the best ways to implement this might be?
An alternate option would be to add weight constraints, and do it using an augmented Lagrangian objective but I would first like to explore the regularization option before going the Lagrangian route.
The current code I have for the initial objective function without additional regularization is the following:
train_x ,train_y are the training data, training targets respectively.
test_x , test_y are the testing data, testing targets respectively.
#Sum of Squared Errs. Cost.
def costfunc(predicted,actual):
return tf.reduce_sum(tf.square(predicted - actual))
#Mean Squared Error Calc.
def prediction(sess,X,y_,test_x,test_y):
pred_y = sess.run(y_,feed_dict={X:test_x})
mymse = tf.reduce_mean(tf.square(pred_y - test_y))
mseval=sess.run(mymse)
return mseval,pred_y
with tf.Session() as sess:
X = tf.placeholder(tf.float32,[None,num_feat]) #Training Data
Y = tf.placeholder(tf.float32,[None,1]) # Target Values
W = tf.Variable(tf.ones([num_feat,1]),name="weights")
init = tf.global_variables_initializer()
sess.run(init)
#Tensorflow ops and cost function definitions.
y_ = tf.matmul(X,W)
cost_history = np.empty(shape=[1],dtype=float)
out_of_sample_cost_history = np.empty(shape=[1],dtype=float)
cost=costfunc(y_,Y)
learning_rate = 0.000001
training_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
for epoch in range(training_epochs):
sess.run(training_step,feed_dict={X:train_x,Y:train_y})
cost_history = np.append(cost_history,sess.run(cost,feed_dict={X: train_x,Y: train_y}))
out_of_sample_cost_history = np.append(out_of_sample_cost_history,sess.run(cost,feed_dict={X:test_x,Y:test_y}))
MSETest,pred_test = prediction(sess,X,y_,test_x,test_y) #Predict on full testing set.

tf.slice will do. And during optimization, the gradients to w1 will be added (because gradients add up at forks). Also, please check the graph on Tensorboard (the link on how to use it).

Tensorflow: How to set the learning rate in log scale and some Tensorflow questions

I am a deep learning and Tensorflow beginner and I am trying to implement the algorithm in this paper using Tensorflow. This paper uses Matconvnet+Matlab to implement it, and I am curious if Tensorflow has the equivalent functions to achieve the same thing. The paper said:
The network parameters were initialized using the Xavier method [14]. We used the regression loss across four wavelet subbands under l2 penalty and the proposed network was trained by using the stochastic gradient descent (SGD). The regularization parameter (λ) was 0.0001 and the momentum was 0.9. The learning rate was set from 10−1 to 10−4 which was reduced in log scale at each epoch.
This paper uses wavelet transform (WT) and residual learning method (where the residual image = WT(HR) - WT(HR'), and the HR' are used for training). Xavier method suggests to initialize the variables normal distribution with
stddev=sqrt(2/(filter_size*filter_size*num_filters)
Q1. How should I initialize the variables? Is the code below correct?
weights = tf.Variable(tf.random_normal[img_size, img_size, 1, num_filters], stddev=stddev)
This paper does not explain how to construct the loss function in details . I am unable to find the equivalent Tensorflow function to set the learning rate in log scale (only exponential_decay). I understand MomentumOptimizer is equivalent to Stochastic Gradient Descent with momentum.
Q2: Is it possible to set the learning rate in log scale?
Q3: How to create the loss function described above?
I followed this website to write the code below. Assume model() function returns the network mentioned in this paper and lamda=0.0001,
inputs = tf.placeholder(tf.float32, shape=[None, patch_size, patch_size, num_channels])
labels = tf.placeholder(tf.float32, [None, patch_size, patch_size, num_channels])
# get the model output and weights for each conv
pred, weights = model()
# define loss function
loss = tf.nn.softmax_cross_entropy_with_logits_v2(labels=labels, logits=pred)
for weight in weights:
regularizers += tf.nn.l2_loss(weight)
loss = tf.reduce_mean(loss + 0.0001 * regularizers)
learning_rate = tf.train.exponential_decay(???) # Not sure if we can have custom learning rate for log scale
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum).minimize(loss, global_step)
NOTE: As I am a deep learning/Tensorflow beginner, I copy-paste code here and there so please feel free to correct it if you can ;)

Q1. How should I initialize the variables? Is the code below correct?
Use tf.get_variable or switch to slim (it does the initialization automatically for you). example
Q2: Is it possible to set the learning rate in log scale?
You can but do you need it? This is not the first thing that you need to solve in this network. Please check #3
However, just for reference, use following notation.
learning_rate_node = tf.train.exponential_decay(learning_rate=0.001, decay_steps=10000, decay_rate=0.98, staircase=True)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate_node).minimize(loss)
Q3: How to create the loss function described above?
At first, you have not written "pred" to "image" conversion to this message(Based on the paper you need to apply subtraction and IDWT to obtain final image).
There is one problem here, logits have to be calculated based on your label data. i.e. if you will use marked data as "Y : Label", you need to write
pred = model()
pred = tf.matmul(pred, weights) + biases
logits = tf.nn.softmax(pred)
loss = tf.reduce_mean(tf.abs(logits - labels))
This will give you the output of Y : Label to be used
If your dataset's labeled images are denoised ones, in this case you need to follow this one:
pred = model()
pred = tf.matmul(image, weights) + biases
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(image - labels))
Logits are the output of your network. You will use this one as result to calculate the rest. Instead of matmul, you can add a conv2d layer in here without a batch normalization and an activation function and set output feature count as 4. Example:
pred = model()
pred = slim.conv2d(pred, 4, [3, 3], activation_fn=None, padding='SAME', scope='output')
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(logits - labels))
This loss function will give you basic training capabilities. However, this is L1 distance and it may suffer from some issues (check). Think following situation
Let's say you have following array as output [10, 10, 10, 0, 0] and you try to achieve [10, 10, 10, 10, 10]. In this case, your loss is 20 (10 + 10). However, you have 3/5 success. Also, it may indicate some overfit.
For same case, think following output [6, 6, 6, 6, 6]. It still has loss of 20 (4 + 4 + 4 + 4 + 4). However, whenever you apply threshold of 5, you can achieve 5/5 success. Hence, this is the case that we want.
If you use L2 loss, for the first case, you will have 10^2 + 10^2 = 200 as loss output. For the second case, you will get 4^2 * 5 = 80.
Hence, optimizer will try to run away from #1 as quick as possible to achieve global success rather than perfect success of some outputs and complete failure of the others. You can apply loss function like this for that.
tf.reduce_mean(tf.nn.l2_loss(logits - image))
Alternatively, you can check for cross entropy loss function. (it does apply softmax internally, do not apply softmax twice)
tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, image))

Q1. How should I initialize the variables? Is the code below correct?
That's correct (although missing an opening parentheses). You could also look into tf.get_variable if the variables are going to be reused.
Q2: Is it possible to set the learning rate in log scale?
Exponential decay decreases the learning rate at every step. I think what you want is tf.train.piecewise_constant, and set boundaries at each epoch.
EDIT: Look at the other answer, use the staircase=True argument!
Q3: How to create the loss function described above?
Your loss function looks correct.

Other answers are very detailed and helpful. Here is a code example that uses placeholder to decay learning rate at log scale. HTH.
import tensorflow as tf
import numpy as np
# data simulation
N = 10000
D = 10
x = np.random.rand(N, D)
w = np.random.rand(D,1)
y = np.dot(x, w)
print y.shape
#modeling
batch_size = 100
tni = tf.truncated_normal_initializer()
X = tf.placeholder(tf.float32, [batch_size, D])
Y = tf.placeholder(tf.float32, [batch_size,1])
W = tf.get_variable("w", shape=[D,1], initializer=tni)
B = tf.zeros([1])
lr = tf.placeholder(tf.float32)
pred = tf.add(tf.matmul(X,W), B)
print pred.shape
mse = tf.reduce_sum(tf.losses.mean_squared_error(Y, pred))
opt = tf.train.MomentumOptimizer(lr, 0.9)
train_op = opt.minimize(mse)
learning_rate = 0.0001
do_train = True
acc_err = 0.0
sess = tf.Session()
sess.run(tf.global_variables_initializer())
while do_train:
for i in range (100000):
if i > 0 and i % N == 0:
# epoch done, decrease learning rate by 2
learning_rate /= 2
print "Epoch completed. LR =", learning_rate
idx = i/batch_size + i%batch_size
f = {X:x[idx:idx+batch_size,:], Y:y[idx:idx+batch_size,:], lr: learning_rate}
_, err = sess.run([train_op, mse], feed_dict = f)
acc_err += err
if i%5000 == 0:
print "Average error = {}".format(acc_err/5000)
acc_err = 0.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to calculate Gradient of the loss with respect to input? - python

Related

How to make predictions on new dataset with tensorflow's gradient tape

Tensorflow : result of training data(using sigmoid) came out reversely

Keras - preprocessing and scaling for forked architecture

Add Custom Regularization to Tensorflow

Tensorflow: How to set the learning rate in log scale and some Tensorflow questions

Categories

Resources