Applying custom learning rates to variables in Tensorflow

Applying custom learning rates to variables in Tensorflow - python

In Tensorflow, after I obtain my loss term, I give it to an optimizer and it adds the necessary differentiation and update terms to the computation graph:
global_counter = tf.Variable(0, dtype=DATA_TYPE, trainable=False)
learning_rate = tf.train.exponential_decay(
INITIAL_LR, # Base learning rate.
global_counter, # Current index into the dataset.
DECAY_STEP, # Decay step.
DECAY_RATE, # Decay rate.
staircase=True)
optimizer = tf.train.MomentumOptimizer(learning_rate, 0.9).minimize(network.finalLoss, global_step=global_counter)
feed_dict = {TRAIN_DATA_TENSOR: samples, TRAIN_LABEL_TENSOR: labels}
results = sess.run([optimizer], feed_dict=feed_dict)
I want a small modification to this process. I want to scale the learning_rate differently for my every distinct parameter in the network. For example, let A and B two different trainable parameters in the network and let dL/dA and dL/dB the partial derivatives of the parameters with respect to the loss. The momentum optimizer updates the variables as:
Ma <- 0.9*Ma + learning_rate*dL/dA
A <- A - Ma
Mb <- 0.9*Mb + learning_rate*dL/dB
B <- B - Mb
I want to modify this as:
Ma <- 0.9*Ma + ca*learning_rate*dL/dA
A <- A - Ma
Mb <- 0.9*Mb + cb*learning_rate*dL/dB
B <- B - Mb
Where ca and cb are special learning rate scales for different parameters. As far as I understand, Tensorflow has compute_gradients and apply_gradients methods we can call for such cases, but the documentation is not very clear about how to use them. Any help would be much appreciated.

TO calculate gradient:
self.gradients = tf.gradients(self.loss, tf.trainable_variables())
Now, you access the gradients using sess.run([model.gradients], feed_dict)
Assuming, you have declared the learning_rate as a tf.Variable(), you can assign the learning rate using the following code:
sess.run(tf.assign(model.lr, args.learning_rate * (args.decay_rate ** epoch)))
The above code is just an example. You can modify it to be used for your purpose.

Custom learning rate, in tensorflow
are very easy to handle.
learning_rate = tf.Variable(INITIAL_LR,trainable=False,name="lr")
and say l1 and l2 are two different learning rates :
l1 = ca * learning_rate
l2 = cb * learning_rate
you can do any type of mathematical manipulation with respect to learning rate, and apply it in this manner :
optimizer=tf.train.MomentumOptimizer(l1,0.9).minimize(network.finalLoss, global_step=global_counter)
Regarding your problem: what you want is actually different gradient for different layers, say L1 layer (trainable variables containing Ma) and L2
(trainable variables containing Mb)
global_counter = tf.Variable(0, dtype=DATA_TYPE, trainable=False)
learning_rate = tf.train.exponential_decay(
INITIAL_LR, # Base learning rate.
global_counter, # Current index into the dataset.
DECAY_STEP, # Decay step.
DECAY_RATE, # Dec
staircase=True)
optimizer1 = tf.train.MomentumOptimizer(ca * learning_rate, 0.9).minimize(network.finalLoss, global_step=global_counter , var_list= L1)
optimizer2 = tf.train.MomentumOptimizer(cb * learning_rate, 0.9).minimize(network.finalLoss, global_step=global_counter , var_list= L2)
optimizer = tf.group(optimizer1 , optimizer2)
feed_dict = {TRAIN_DATA_TENSOR: samples, TRAIN_LABEL_TENSOR: labels}
results = sess.run([optimizer], feed_dict=feed_dict)
You can find the optimized version of the above code here
Please note if you can designate learning rate via tf.assign it returns the reference to the learning rate whereas the optimizer expects a float learning value type which probably will/should throw an error

Related

Why doesn't the Adadelta optimizer decay the learning rate?

I have initialised an Adadelta optimizer in Keras (using Tensorflow backend) and assigned it to a model:
my_adadelta = keras.optimizers.Adadelta(learning_rate=0.01, rho=0.95)
my_model.compile(optimizer=my_adadelta, loss="binary_crossentropy")
During training, I am using a callback to print the learning rate after every epoch:
class LRPrintCallback(Callback):
def on_epoch_end(self, epoch, logs=None):
lr = self.model.optimizer.lr
print(K.eval(lr))
However, this prints the same (initial) learning rate after every epoch.
The same thing happens if I initialize the optimizer like this:
my_adadelta = keras.optimizers.Adadelta(learning_rate=0.01, decay=0.95)
Am I doing something wrong in the initialization? Is the learning rate maybe changing but I am not printing the right thing?

As discussed in a relevant Github thread, the decay does not affect the variable lr itself, which is used only to store the initial value of the learning rate. In order to print the decayed value, you need to explicitly compute it yourself and store it in a separate variable lr_with_decay; you can do so by using the following callback:
class MyCallback(Callback):
def on_epoch_end(self, epoch, logs=None):
lr = self.model.optimizer.lr
decay = self.model.optimizer.decay
iterations = self.model.optimizer.iterations
lr_with_decay = lr / (1. + decay * K.cast(iterations, K.dtype(decay)))
print(K.eval(lr_with_decay))
as explained here and here. In fact, the specific code snippet suggested there, i.e.
lr = self.lr
if self.initial_decay > 0:
lr *= (1. / (1. + self.decay * K.cast(self.iterations, K.dtype(self.decay))))
comes directly from the underlying Keras source code for Adadelta.
As clear from the inspection of the linked source code, the parameter of interest here for decaying the learning rate is decay, and not rho; despite the term 'decay' used also for describing rho in the documentation, it is a different decay not having anything to do with the learning rate:
rho: float >= 0. Adadelta decay factor, corresponding to fraction of gradient to keep at each time step.

Does gradient descent optimizer change my bias? If so, is it by the learning rate?

I'm trying to program linear regression without much external help and I've done it successfully to an extent since my MSE usually returns a small number and the outputted line of best fit looks about right. I just have a question about the last line of code below. Does the optimizer also change the bias, and if so, is it by the learning rate?
#tf graph input, the 9 training values
X = tf.placeholder("float")
Y = tf.placeholder("float")
random = random.uniform(0,20)
#weights and biases
W = tf.Variable((random), name = "Weight")
b = tf.Variable((random), name = "Bias")
#linear model multiply x by weights and biases to get a y
pred = tf.add(tf.multiply(X, W), b)
#cost function to reduce the error. MSE
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
#minimize cost taking steps of 0.01 down the parabola
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

Yes, the optimizer changes the bias and the learning is done with respect to learning rate. Optimizers update all the trainable variables in the graph unless the var_list option is set (in which case they update the variables in that list).

TensorFlow - add variables (weights) constraint

Just a toy example. Suppose we have 5 stocks and we want to find the best portfolio structure (linear weights) maximizing our PnL on history. Weights are used to build portfolio invested in equities.
weights = tf.Variable(np.random.random((5, 1)), dtype=tf.double)
returns = tf.placeholder(dtype=tf.double)
portfolio = tf.matmul(returns, weights)
pnl = portfolio[-1]
optimizer = tf.train.AdamOptimizer().minimize(-1*pnl)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
train_data = {returns: returns_data}
for i in range(100):
sess.run(optimizer, feed_dict=train_data)
I want to find the best decision on history with following constraints:
each stock individual weight (min: 0.05, max: 0.5)
weights vector sum = 1 (portfolio is always invested)
How can I implement weights constraints to the optimizer?

For your first question, you can clip values into your matrice :
weights = tf.Variable(np.random.random((5, 1)), dtype=tf.double)
weights = tf.clip_by_value(weights, 0.05, 0.5)
The softmax function can be an answer to your second question (https://en.wikipedia.org/wiki/Softmax_function)
By the way, optimizer = tf.train.AdamOptimizer().minimize(-1*pnl) won't work. The optimizer needs to know the variable list from which it needs to minimize the loss. Therefore it is : optimizer = tf.train.AdamOptimizer().minimize(-1*pnl, weights)

Tensorflow: How to set the learning rate in log scale and some Tensorflow questions

I am a deep learning and Tensorflow beginner and I am trying to implement the algorithm in this paper using Tensorflow. This paper uses Matconvnet+Matlab to implement it, and I am curious if Tensorflow has the equivalent functions to achieve the same thing. The paper said:
The network parameters were initialized using the Xavier method [14]. We used the regression loss across four wavelet subbands under l2 penalty and the proposed network was trained by using the stochastic gradient descent (SGD). The regularization parameter (λ) was 0.0001 and the momentum was 0.9. The learning rate was set from 10−1 to 10−4 which was reduced in log scale at each epoch.
This paper uses wavelet transform (WT) and residual learning method (where the residual image = WT(HR) - WT(HR'), and the HR' are used for training). Xavier method suggests to initialize the variables normal distribution with
stddev=sqrt(2/(filter_size*filter_size*num_filters)
Q1. How should I initialize the variables? Is the code below correct?
weights = tf.Variable(tf.random_normal[img_size, img_size, 1, num_filters], stddev=stddev)
This paper does not explain how to construct the loss function in details . I am unable to find the equivalent Tensorflow function to set the learning rate in log scale (only exponential_decay). I understand MomentumOptimizer is equivalent to Stochastic Gradient Descent with momentum.
Q2: Is it possible to set the learning rate in log scale?
Q3: How to create the loss function described above?
I followed this website to write the code below. Assume model() function returns the network mentioned in this paper and lamda=0.0001,
inputs = tf.placeholder(tf.float32, shape=[None, patch_size, patch_size, num_channels])
labels = tf.placeholder(tf.float32, [None, patch_size, patch_size, num_channels])
# get the model output and weights for each conv
pred, weights = model()
# define loss function
loss = tf.nn.softmax_cross_entropy_with_logits_v2(labels=labels, logits=pred)
for weight in weights:
regularizers += tf.nn.l2_loss(weight)
loss = tf.reduce_mean(loss + 0.0001 * regularizers)
learning_rate = tf.train.exponential_decay(???) # Not sure if we can have custom learning rate for log scale
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum).minimize(loss, global_step)
NOTE: As I am a deep learning/Tensorflow beginner, I copy-paste code here and there so please feel free to correct it if you can ;)

Q1. How should I initialize the variables? Is the code below correct?
Use tf.get_variable or switch to slim (it does the initialization automatically for you). example
Q2: Is it possible to set the learning rate in log scale?
You can but do you need it? This is not the first thing that you need to solve in this network. Please check #3
However, just for reference, use following notation.
learning_rate_node = tf.train.exponential_decay(learning_rate=0.001, decay_steps=10000, decay_rate=0.98, staircase=True)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate_node).minimize(loss)
Q3: How to create the loss function described above?
At first, you have not written "pred" to "image" conversion to this message(Based on the paper you need to apply subtraction and IDWT to obtain final image).
There is one problem here, logits have to be calculated based on your label data. i.e. if you will use marked data as "Y : Label", you need to write
pred = model()
pred = tf.matmul(pred, weights) + biases
logits = tf.nn.softmax(pred)
loss = tf.reduce_mean(tf.abs(logits - labels))
This will give you the output of Y : Label to be used
If your dataset's labeled images are denoised ones, in this case you need to follow this one:
pred = model()
pred = tf.matmul(image, weights) + biases
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(image - labels))
Logits are the output of your network. You will use this one as result to calculate the rest. Instead of matmul, you can add a conv2d layer in here without a batch normalization and an activation function and set output feature count as 4. Example:
pred = model()
pred = slim.conv2d(pred, 4, [3, 3], activation_fn=None, padding='SAME', scope='output')
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(logits - labels))
This loss function will give you basic training capabilities. However, this is L1 distance and it may suffer from some issues (check). Think following situation
Let's say you have following array as output [10, 10, 10, 0, 0] and you try to achieve [10, 10, 10, 10, 10]. In this case, your loss is 20 (10 + 10). However, you have 3/5 success. Also, it may indicate some overfit.
For same case, think following output [6, 6, 6, 6, 6]. It still has loss of 20 (4 + 4 + 4 + 4 + 4). However, whenever you apply threshold of 5, you can achieve 5/5 success. Hence, this is the case that we want.
If you use L2 loss, for the first case, you will have 10^2 + 10^2 = 200 as loss output. For the second case, you will get 4^2 * 5 = 80.
Hence, optimizer will try to run away from #1 as quick as possible to achieve global success rather than perfect success of some outputs and complete failure of the others. You can apply loss function like this for that.
tf.reduce_mean(tf.nn.l2_loss(logits - image))
Alternatively, you can check for cross entropy loss function. (it does apply softmax internally, do not apply softmax twice)
tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, image))

Q1. How should I initialize the variables? Is the code below correct?
That's correct (although missing an opening parentheses). You could also look into tf.get_variable if the variables are going to be reused.
Q2: Is it possible to set the learning rate in log scale?
Exponential decay decreases the learning rate at every step. I think what you want is tf.train.piecewise_constant, and set boundaries at each epoch.
EDIT: Look at the other answer, use the staircase=True argument!
Q3: How to create the loss function described above?
Your loss function looks correct.

Other answers are very detailed and helpful. Here is a code example that uses placeholder to decay learning rate at log scale. HTH.
import tensorflow as tf
import numpy as np
# data simulation
N = 10000
D = 10
x = np.random.rand(N, D)
w = np.random.rand(D,1)
y = np.dot(x, w)
print y.shape
#modeling
batch_size = 100
tni = tf.truncated_normal_initializer()
X = tf.placeholder(tf.float32, [batch_size, D])
Y = tf.placeholder(tf.float32, [batch_size,1])
W = tf.get_variable("w", shape=[D,1], initializer=tni)
B = tf.zeros([1])
lr = tf.placeholder(tf.float32)
pred = tf.add(tf.matmul(X,W), B)
print pred.shape
mse = tf.reduce_sum(tf.losses.mean_squared_error(Y, pred))
opt = tf.train.MomentumOptimizer(lr, 0.9)
train_op = opt.minimize(mse)
learning_rate = 0.0001
do_train = True
acc_err = 0.0
sess = tf.Session()
sess.run(tf.global_variables_initializer())
while do_train:
for i in range (100000):
if i > 0 and i % N == 0:
# epoch done, decrease learning rate by 2
learning_rate /= 2
print "Epoch completed. LR =", learning_rate
idx = i/batch_size + i%batch_size
f = {X:x[idx:idx+batch_size,:], Y:y[idx:idx+batch_size,:], lr: learning_rate}
_, err = sess.run([train_op, mse], feed_dict = f)
acc_err += err
if i%5000 == 0:
print "Average error = {}".format(acc_err/5000)
acc_err = 0.0

TensorFlow: adding regularization to LSTM

Following Tensorflow LSTM Regularization I am trying to add regularization term to the cost function when training parameters of LSTM cells.
Putting aside some constants I have:
def RegularizationCost(trainable_variables):
cost = 0
for v in trainable_variables:
cost += r(tf.reduce_sum(tf.pow(r(v.name),2)))
return cost
...
regularization_cost = tf.placeholder(tf.float32, shape = ())
cost = tf.reduce_sum(tf.pow(pred - y, 2)) + regularization_cost
optimizer = tf.train.AdamOptimizer(learning_rate = 0.01).minimize(cost)
...
tv = tf.trainable_variables()
s = tf.Session()
r = s.run
...
while (...):
...
reg_cost = RegularizationCost(tv)
r(optimizer, feed_dict = {x: x_b, y: y_b, regularization_cost: reg_cost})
The problem I have is that adding the regularization term hugely slows the learning process and actually the regularization term reg_cost is increasing with each iteration visibly when the term associated with pred - y pretty much stagnated i.e. the reg_cost seems not to be taken into account.
As I suspect I am adding this term in completely wrong way. I did not know how to add this term in the cost function itself so I used a workaround with scalar tf.placeholder and "manually" calculated the regularization cost. How to do it properly?

compute the L2 loss only once:
tv = tf.trainable_variables()
regularization_cost = tf.reduce_sum([ tf.nn.l2_loss(v) for v in tv ])
cost = tf.reduce_sum(tf.pow(pred - y, 2)) + regularization_cost
optimizer = tf.train.AdamOptimizer(learning_rate = 0.01).minimize(cost)
you might want to remove the variables that are bias as those should not be regularized.

It slows down because your code creates new nodes in every iteration. This is not how you code with TF. First, you create your whole graph, including regularization terms, then, in the while loop you only execute them, each "tf.XXX" operation creates new nodes.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Applying custom learning rates to variables in Tensorflow - python

Related

Why doesn't the Adadelta optimizer decay the learning rate?

Does gradient descent optimizer change my bias? If so, is it by the learning rate?

TensorFlow - add variables (weights) constraint

Tensorflow: How to set the learning rate in log scale and some Tensorflow questions

TensorFlow: adding regularization to LSTM

Categories

Resources