Add Custom Regularization to Tensorflow

Add Custom Regularization to Tensorflow - python

I am using tensorflow to optimize a simple least squares objective function like the following:
Here, Y is the target vector ,X is the input matrix and vector w represents the weights to be learned.
Example Scenario:
, ,
If I wanted to augment the initial objective function to impose an additional constraint on w1 (the first scalar value in the tensorflow Variable w and X1 represents the first column of the feature matrix X), how would I achieve this in tensorflow?
One solution I can think of is to use tf.slice to index the first value of $w$ and add this in addition to the original cost term but I am not convinced that it will have the desired effect on the weights.
I would appreciate inputs on whether something like this is possible in tensorflow and if so, what the best ways to implement this might be?
An alternate option would be to add weight constraints, and do it using an augmented Lagrangian objective but I would first like to explore the regularization option before going the Lagrangian route.
The current code I have for the initial objective function without additional regularization is the following:
train_x ,train_y are the training data, training targets respectively.
test_x , test_y are the testing data, testing targets respectively.
#Sum of Squared Errs. Cost.
def costfunc(predicted,actual):
return tf.reduce_sum(tf.square(predicted - actual))
#Mean Squared Error Calc.
def prediction(sess,X,y_,test_x,test_y):
pred_y = sess.run(y_,feed_dict={X:test_x})
mymse = tf.reduce_mean(tf.square(pred_y - test_y))
mseval=sess.run(mymse)
return mseval,pred_y
with tf.Session() as sess:
X = tf.placeholder(tf.float32,[None,num_feat]) #Training Data
Y = tf.placeholder(tf.float32,[None,1]) # Target Values
W = tf.Variable(tf.ones([num_feat,1]),name="weights")
init = tf.global_variables_initializer()
sess.run(init)
#Tensorflow ops and cost function definitions.
y_ = tf.matmul(X,W)
cost_history = np.empty(shape=[1],dtype=float)
out_of_sample_cost_history = np.empty(shape=[1],dtype=float)
cost=costfunc(y_,Y)
learning_rate = 0.000001
training_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
for epoch in range(training_epochs):
sess.run(training_step,feed_dict={X:train_x,Y:train_y})
cost_history = np.append(cost_history,sess.run(cost,feed_dict={X: train_x,Y: train_y}))
out_of_sample_cost_history = np.append(out_of_sample_cost_history,sess.run(cost,feed_dict={X:test_x,Y:test_y}))
MSETest,pred_test = prediction(sess,X,y_,test_x,test_y) #Predict on full testing set.

tf.slice will do. And during optimization, the gradients to w1 will be added (because gradients add up at forks). Also, please check the graph on Tensorboard (the link on how to use it).

Related

Does gradient descent optimizer change my bias? If so, is it by the learning rate?

I'm trying to program linear regression without much external help and I've done it successfully to an extent since my MSE usually returns a small number and the outputted line of best fit looks about right. I just have a question about the last line of code below. Does the optimizer also change the bias, and if so, is it by the learning rate?
#tf graph input, the 9 training values
X = tf.placeholder("float")
Y = tf.placeholder("float")
random = random.uniform(0,20)
#weights and biases
W = tf.Variable((random), name = "Weight")
b = tf.Variable((random), name = "Bias")
#linear model multiply x by weights and biases to get a y
pred = tf.add(tf.multiply(X, W), b)
#cost function to reduce the error. MSE
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
#minimize cost taking steps of 0.01 down the parabola
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

Yes, the optimizer changes the bias and the learning is done with respect to learning rate. Optimizers update all the trainable variables in the graph unless the var_list option is set (in which case they update the variables in that list).

Tensorflow: How to set the learning rate in log scale and some Tensorflow questions

I am a deep learning and Tensorflow beginner and I am trying to implement the algorithm in this paper using Tensorflow. This paper uses Matconvnet+Matlab to implement it, and I am curious if Tensorflow has the equivalent functions to achieve the same thing. The paper said:
The network parameters were initialized using the Xavier method [14]. We used the regression loss across four wavelet subbands under l2 penalty and the proposed network was trained by using the stochastic gradient descent (SGD). The regularization parameter (λ) was 0.0001 and the momentum was 0.9. The learning rate was set from 10−1 to 10−4 which was reduced in log scale at each epoch.
This paper uses wavelet transform (WT) and residual learning method (where the residual image = WT(HR) - WT(HR'), and the HR' are used for training). Xavier method suggests to initialize the variables normal distribution with
stddev=sqrt(2/(filter_size*filter_size*num_filters)
Q1. How should I initialize the variables? Is the code below correct?
weights = tf.Variable(tf.random_normal[img_size, img_size, 1, num_filters], stddev=stddev)
This paper does not explain how to construct the loss function in details . I am unable to find the equivalent Tensorflow function to set the learning rate in log scale (only exponential_decay). I understand MomentumOptimizer is equivalent to Stochastic Gradient Descent with momentum.
Q2: Is it possible to set the learning rate in log scale?
Q3: How to create the loss function described above?
I followed this website to write the code below. Assume model() function returns the network mentioned in this paper and lamda=0.0001,
inputs = tf.placeholder(tf.float32, shape=[None, patch_size, patch_size, num_channels])
labels = tf.placeholder(tf.float32, [None, patch_size, patch_size, num_channels])
# get the model output and weights for each conv
pred, weights = model()
# define loss function
loss = tf.nn.softmax_cross_entropy_with_logits_v2(labels=labels, logits=pred)
for weight in weights:
regularizers += tf.nn.l2_loss(weight)
loss = tf.reduce_mean(loss + 0.0001 * regularizers)
learning_rate = tf.train.exponential_decay(???) # Not sure if we can have custom learning rate for log scale
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum).minimize(loss, global_step)
NOTE: As I am a deep learning/Tensorflow beginner, I copy-paste code here and there so please feel free to correct it if you can ;)

Q1. How should I initialize the variables? Is the code below correct?
Use tf.get_variable or switch to slim (it does the initialization automatically for you). example
Q2: Is it possible to set the learning rate in log scale?
You can but do you need it? This is not the first thing that you need to solve in this network. Please check #3
However, just for reference, use following notation.
learning_rate_node = tf.train.exponential_decay(learning_rate=0.001, decay_steps=10000, decay_rate=0.98, staircase=True)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate_node).minimize(loss)
Q3: How to create the loss function described above?
At first, you have not written "pred" to "image" conversion to this message(Based on the paper you need to apply subtraction and IDWT to obtain final image).
There is one problem here, logits have to be calculated based on your label data. i.e. if you will use marked data as "Y : Label", you need to write
pred = model()
pred = tf.matmul(pred, weights) + biases
logits = tf.nn.softmax(pred)
loss = tf.reduce_mean(tf.abs(logits - labels))
This will give you the output of Y : Label to be used
If your dataset's labeled images are denoised ones, in this case you need to follow this one:
pred = model()
pred = tf.matmul(image, weights) + biases
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(image - labels))
Logits are the output of your network. You will use this one as result to calculate the rest. Instead of matmul, you can add a conv2d layer in here without a batch normalization and an activation function and set output feature count as 4. Example:
pred = model()
pred = slim.conv2d(pred, 4, [3, 3], activation_fn=None, padding='SAME', scope='output')
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(logits - labels))
This loss function will give you basic training capabilities. However, this is L1 distance and it may suffer from some issues (check). Think following situation
Let's say you have following array as output [10, 10, 10, 0, 0] and you try to achieve [10, 10, 10, 10, 10]. In this case, your loss is 20 (10 + 10). However, you have 3/5 success. Also, it may indicate some overfit.
For same case, think following output [6, 6, 6, 6, 6]. It still has loss of 20 (4 + 4 + 4 + 4 + 4). However, whenever you apply threshold of 5, you can achieve 5/5 success. Hence, this is the case that we want.
If you use L2 loss, for the first case, you will have 10^2 + 10^2 = 200 as loss output. For the second case, you will get 4^2 * 5 = 80.
Hence, optimizer will try to run away from #1 as quick as possible to achieve global success rather than perfect success of some outputs and complete failure of the others. You can apply loss function like this for that.
tf.reduce_mean(tf.nn.l2_loss(logits - image))
Alternatively, you can check for cross entropy loss function. (it does apply softmax internally, do not apply softmax twice)
tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, image))

Q1. How should I initialize the variables? Is the code below correct?
That's correct (although missing an opening parentheses). You could also look into tf.get_variable if the variables are going to be reused.
Q2: Is it possible to set the learning rate in log scale?
Exponential decay decreases the learning rate at every step. I think what you want is tf.train.piecewise_constant, and set boundaries at each epoch.
EDIT: Look at the other answer, use the staircase=True argument!
Q3: How to create the loss function described above?
Your loss function looks correct.

Other answers are very detailed and helpful. Here is a code example that uses placeholder to decay learning rate at log scale. HTH.
import tensorflow as tf
import numpy as np
# data simulation
N = 10000
D = 10
x = np.random.rand(N, D)
w = np.random.rand(D,1)
y = np.dot(x, w)
print y.shape
#modeling
batch_size = 100
tni = tf.truncated_normal_initializer()
X = tf.placeholder(tf.float32, [batch_size, D])
Y = tf.placeholder(tf.float32, [batch_size,1])
W = tf.get_variable("w", shape=[D,1], initializer=tni)
B = tf.zeros([1])
lr = tf.placeholder(tf.float32)
pred = tf.add(tf.matmul(X,W), B)
print pred.shape
mse = tf.reduce_sum(tf.losses.mean_squared_error(Y, pred))
opt = tf.train.MomentumOptimizer(lr, 0.9)
train_op = opt.minimize(mse)
learning_rate = 0.0001
do_train = True
acc_err = 0.0
sess = tf.Session()
sess.run(tf.global_variables_initializer())
while do_train:
for i in range (100000):
if i > 0 and i % N == 0:
# epoch done, decrease learning rate by 2
learning_rate /= 2
print "Epoch completed. LR =", learning_rate
idx = i/batch_size + i%batch_size
f = {X:x[idx:idx+batch_size,:], Y:y[idx:idx+batch_size,:], lr: learning_rate}
_, err = sess.run([train_op, mse], feed_dict = f)
acc_err += err
if i%5000 == 0:
print "Average error = {}".format(acc_err/5000)
acc_err = 0.0

In tensorflow what is the difference between trainable and stop gradient

I would like to know the difference between the option trainable=False and the tf.stop_gradient(). If I make the trainable option False will my optimizer not consider the variable for training?
Does this option make the it a constant value throughout the training?

trainable=False
Here the variable value will be constant throughout the training. Optimizer won't consider this variable for training, no gradient update op.
stop_gradient
In certain situations, you want to calculate the gradient of a op with respect to some variable keeping a few other variables constant; but for other ops you may use those variables also to calculate gradient. So here you can't use trinable=False, as you need those variable for training with other ops.
stop_gradient is very useful for ops; you can selectively optimize a op with respect to select few variables while keeping other constant.
y1 = tf.stop_gradient(W1x+b1)
y2 = W2y1+b2
cost = cost_function(y2, y)
# this following op wont optimize the cost with respect to W1 and b1
train_op_w2_b2 = tf.train.MomentumOptimizer(0.001, 0.9).minimize(cost)
W1 = tf.get_variable('w1', trainable=False)
y1 = W1x+b1
y2 = W2y1+b2
cost = cost_function(y2, y)
# this following op wont optimize the cost with respect to W1
train_op = tf.train.MomentumOptimizer(0.001, 0.9).minimize(cost)

Embedding vectors not being updated when using Tensorflow on window classification

I am trying to implement a window based classifier with tensorflow,
The word embedding matrix is called word_vec and is initialized randomly (I tried Xavier also).
And the ind variable is the a vector of the indices of the word vectors from the matrix.
The first layer is config['window_size'] (5) word vectors concatenated.
word_vecs = tf.Variable(tf.random_uniform([len(words), config['embed_size']], -1.0, 1.0),dtype=tf.float32)
ind = tf.placeholder(tf.int32, [None, config['window_size']])
x = tf.concat(1,tf.unpack(tf.nn.embedding_lookup(word_vecs, ind),axis=1))
W0 = tf.Variable(tf.random_uniform([config['window_size']*config['embed_size'], config['hidden_layer']]))
b0 = tf.Variable(tf.zeros([config['hidden_layer']]))
W1 = tf.Variable(tf.random_uniform([config['hidden_layer'], out_layer]))
b1 = tf.Variable(tf.zeros([out_layer]))
y0 = tf.nn.tanh(tf.matmul(x, W0) + b0)
y1 = tf.nn.softmax(tf.matmul(y0, W1) + b1)
y_ = tf.placeholder(tf.float32, [None, out_layer])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y1), reduction_indices=[1]))
train_step = tf.train.AdamOptimizer(0.5).minimize(cross_entropy)
And this is how I run the graph:
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
for i in range(config['iterations'] ):
r = random.randint(0,len(sentences)-1)
inds=generate_windows([w for w,t in sentences[r]])
#inds now contains an array of n rows on window_size columns
ys=[one_hot(tags.index(t),len(tags)) for w,t in sentences[r]]
#ys now contains an array of n rows on output_size columns
sess.run(train_step, feed_dict={ind: inds, y_: ys})
The dimensions work out, and the code runs
However, the accuracy is near zero, and I suspect that the the word vectors aren't being updated properly.
How can I make tensorflow update the word vectors back from the concatenated window form ?

Your embeddings are initialised using tf.Variable which are by default trainable. They will be updated. The problem might be with the way you are calculating loss. Look at these following lines
y1 = tf.nn.softmax(tf.matmul(y0, W1) + b1)
y_ = tf.placeholder(tf.float32, [None, out_layer])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y1), reduction_indices=[1]))
Here you are calculating the softmax function which converts the scores into probabilities
If the denominator here becomes too large or too small then this function can go for a toss. To avoid this numerical instability usually a small epsilon is added like below. This makes sure that there is numerical stability.
You can see that even after adding an epsilon the softmax functions value remains the same. If you don't handle this on your own then the gradients may not update properly due to vanishing or exploding gradients.
Avoid the three lines of code and use the tensorflow version
tf.nn.sparse_softmax_cross_entropy_with_logits
Note that this function will calculate the softmax function internally.
It is advisable to use this instead of calculating the loss manually. You can use this as follows
y1 = tf.matmul(y0, W1) + b1
y_ = tf.placeholder(tf.float32, [None, out_layer])
cross_entropy = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=y1, labels=y_))

You need to initialize your W matrices to a random value.
Right now y1 is always 0 due to zero initialization.

Your starting of algorithm is fine. But I have some confidence this approach don't work. Actually word to vector trick is became working after estimation approximations are found applicable for NLP . For example techniques called Importance Sampling and Noise-Contrastive Estimation.
So why straight approach doesn't work? I think, that to solve the task model must precisely find right 1 answer from large vocabulary, say 80000 words. 1 from 80000 - is too hard to optimize model, gradients didn't tell anything for most cases.
Update:
I forgot to mention that main reason of estimation approximation is performance issues of straight approach were you have large output. Each iteration steps for all examples require calculating loss for each output unit (like 80000). Optimization will take long time to be intractable.
How to implement right word2vec using sampling and NCE loss? Easily, following tutorial here, loss function looks like this:
loss = tf.reduce_mean(
tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed,
labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
Main idea is we need only few m negative samples and 1 positive. Where m is far less than actual vocabulary size.
Tensorflow also has tf.nn.nce_loss
You can read more about mathematical behind of approaches in online book www.deeplearningbook.org (I. Goodfellow et al)

How to apply gradient clipping in TensorFlow?

Considering the example code.
I would like to know How to apply gradient clipping on this network on the RNN where there is a possibility of exploding gradients.
tf.clip_by_value(t, clip_value_min, clip_value_max, name=None)
This is an example that could be used but where do I introduce this ?
In the def of RNN
lstm_cell = rnn_cell.BasicLSTMCell(n_hidden, forget_bias=1.0)
# Split data because rnn cell needs a list of inputs for the RNN inner loop
_X = tf.split(0, n_steps, _X) # n_steps
tf.clip_by_value(_X, -1, 1, name=None)
But this doesn't make sense as the tensor _X is the input and not the grad what is to be clipped?
Do I have to define my own Optimizer for this or is there a simpler option?

Gradient clipping needs to happen after computing the gradients, but before applying them to update the model's parameters. In your example, both of those things are handled by the AdamOptimizer.minimize() method.
In order to clip your gradients you'll need to explicitly compute, clip, and apply them as described in this section in TensorFlow's API documentation. Specifically you'll need to substitute the call to the minimize() method with something like the following:
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gvs = optimizer.compute_gradients(cost)
capped_gvs = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gvs]
train_op = optimizer.apply_gradients(capped_gvs)

Despite what seems to be popular, you probably want to clip the whole gradient by its global norm:
optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimize = optimizer.apply_gradients(zip(gradients, variables))
Clipping each gradient matrix individually changes their relative scale but is also possible:
optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients = [
None if gradient is None else tf.clip_by_norm(gradient, 5.0)
for gradient in gradients]
optimize = optimizer.apply_gradients(zip(gradients, variables))
In TensorFlow 2, a tape computes the gradients, the optimizers come from Keras, and we don't need to store the update op because it runs automatically without passing it to a session:
optimizer = tf.keras.optimizers.Adam(1e-3)
# ...
with tf.GradientTape() as tape:
loss = ...
variables = ...
gradients = tape.gradient(loss, variables)
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimizer.apply_gradients(zip(gradients, variables))

It's easy for tf.keras!
optimizer = tf.keras.optimizers.Adam(clipvalue=1.0)
This optimizer will clip all gradients to values between [-1.0, 1.0].
See the docs.

This is actually properly explained in the documentation.:
Calling minimize() takes care of both computing the gradients and
applying them to the variables. If you want to process the gradients
before applying them you can instead use the optimizer in three steps:
Compute the gradients with compute_gradients().
Process the gradients as you wish.
Apply the processed gradients with apply_gradients().
And in the example they provide they use these 3 steps:
# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)
Here MyCapper is any function that caps your gradient. The list of useful functions (other than tf.clip_by_value()) is here.

For those who would like to understand the idea of gradient clipping (by norm):
Whenever the gradient norm is greater than a particular threshold, we clip the gradient norm so that it stays within the threshold. This threshold is sometimes set to 5.
Let the gradient be g and the max_norm_threshold be j.
Now, if ||g|| > j , we do:
g = ( j * g ) / ||g||
This is the implementation done in tf.clip_by_norm

IMO the best solution is wrapping your optimizer with TF's estimator decorator tf.contrib.estimator.clip_gradients_by_norm:
original_optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(original_optimizer, clip_norm=5.0)
train_op = optimizer.minimize(loss)
This way you only have to define this once, and not run it after every gradients calculation.
Documentation:
https://www.tensorflow.org/api_docs/python/tf/contrib/estimator/clip_gradients_by_norm

Gradient Clipping basically helps in case of exploding or vanishing gradients.Say your loss is too high which will result in exponential gradients to flow through the network which may result in Nan values . To overcome this we clip gradients within a specific range (-1 to 1 or any range as per condition) .
clipped_value=tf.clip_by_value(grad, -range, +range), var) for grad, var in grads_and_vars
where grads _and_vars are the pairs of gradients (which you calculate via tf.compute_gradients) and their variables they will be applied to.
After clipping we simply apply its value using an optimizer.
optimizer.apply_gradients(clipped_value)

Method 1
if you are training your model using your custom training loop then the one update step will look like
'''
for loop over full dataset
X -> training samples
y -> labels
'''
optimizer = tf.keras.optimizers.Adam()
for x, y in train_Data:
with tf.GradientTape() as tape:
prob = model(x, training=True)
# calculate loss
train_loss_value = loss_fn(y, prob)
# get gradients
gradients = tape.gradient(train_loss_value, model.trainable_weights)
# clip gradients if you want to clip by norm
gradients = [(tf.clip_by_norm(grad, clip_norm=1.0)) for grad in gradients]
# clip gradients via values
gradients = [(tf.clip_by_value(grad, clip_value_min=-1.0, clip_value_max=1.0)) for grad in gradients]
# apply gradients
optimizer.apply_gradients(zip(gradients, model.trainable_weights))
Method 2
Or you could also simply just replace the first line in above code as below
# for clipping by norm
optimizer = tf.keras.optimizers.Adam(clipnorm=1.0)
# for clipping by value
optimizer = tf.keras.optimizers.Adam(clipvalue=0.5)
second method will also work if you are using model.compile -> model.fit pipeline.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Add Custom Regularization to Tensorflow - python

tf.slice will do. And during optimization, the gradients to w1 will be added (because gradients add up at forks). Also, please check the graph on Tensorboard (the link on how to use it).

Related

Does gradient descent optimizer change my bias? If so, is it by the learning rate?

Tensorflow: How to set the learning rate in log scale and some Tensorflow questions

In tensorflow what is the difference between trainable and stop gradient

Embedding vectors not being updated when using Tensorflow on window classification

How to apply gradient clipping in TensorFlow?

Categories

Resources