How to do Xavier initialization on TensorFlow

How to do Xavier initialization on TensorFlow - python

I'm porting my Caffe network over to TensorFlow but it doesn't seem to have xavier initialization. I'm using truncated_normal but this seems to be making it a lot harder to train.

Since version 0.8 there is a Xavier initializer, see here for the docs.
You can use something like this:
W = tf.get_variable("W", shape=[784, 256],
initializer=tf.contrib.layers.xavier_initializer())

Just to add another example on how to define a tf.Variable initialized using Xavier and Yoshua's method:
graph = tf.Graph()
with graph.as_default():
...
initializer = tf.contrib.layers.xavier_initializer()
w1 = tf.Variable(initializer(w1_shape))
b1 = tf.Variable(initializer(b1_shape))
...
This prevented me from having nan values on my loss function due to numerical instabilities when using multiple layers with RELUs.

In Tensorflow 2.0 and further both tf.contrib.* and tf.get_variable() are deprecated. In order to do Xavier initialization you now have to switch to:
init = tf.initializers.GlorotUniform()
var = tf.Variable(init(shape=shape))
# or a oneliner with a little confusing brackets
var = tf.Variable(tf.initializers.GlorotUniform()(shape=shape))
Glorot uniform and Xavier uniform are two different names of the same initialization type. If you want to know more about how to use initializations in TF2.0 with or without Keras refer to documentation.

#Aleph7, Xavier/Glorot initialization depends the number of incoming connections (fan_in), number outgoing connections (fan_out), and kind of activation function (sigmoid or tanh) of the neuron. See this: http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf
So now, to your question. This is how I would do it in TensorFlow:
(fan_in, fan_out) = ...
low = -4*np.sqrt(6.0/(fan_in + fan_out)) # use 4 for sigmoid, 1 for tanh activation
high = 4*np.sqrt(6.0/(fan_in + fan_out))
return tf.Variable(tf.random_uniform(shape, minval=low, maxval=high, dtype=tf.float32))
Note that we should be sampling from a uniform distribution, and not the normal distribution as suggested in the other answer.
Incidentally, I wrote a post yesterday for something different using TensorFlow that happens to also use Xavier initialization. If you're interested, there's also a python notebook with an end-to-end example: https://github.com/delip/blog-stuff/blob/master/tensorflow_ufp.ipynb

A nice wrapper around tensorflow called prettytensor gives an implementation in the source code (copied directly from here):
def xavier_init(n_inputs, n_outputs, uniform=True):
"""Set the parameter initialization using the method described.
This method is designed to keep the scale of the gradients roughly the same
in all layers.
Xavier Glorot and Yoshua Bengio (2010):
Understanding the difficulty of training deep feedforward neural
networks. International conference on artificial intelligence and
statistics.
Args:
n_inputs: The number of input nodes into each output.
n_outputs: The number of output nodes for each input.
uniform: If true use a uniform distribution, otherwise use a normal.
Returns:
An initializer.
"""
if uniform:
# 6 was used in the paper.
init_range = math.sqrt(6.0 / (n_inputs + n_outputs))
return tf.random_uniform_initializer(-init_range, init_range)
else:
# 3 gives us approximately the same limits as above since this repicks
# values greater than 2 standard deviations from the mean.
stddev = math.sqrt(3.0 / (n_inputs + n_outputs))
return tf.truncated_normal_initializer(stddev=stddev)

TF-contrib has xavier_initializer. Here is an example how to use it:
import tensorflow as tf
a = tf.get_variable("a", shape=[4, 4], initializer=tf.contrib.layers.xavier_initializer())
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print sess.run(a)
In addition to this, tensorflow has other initializers:
xavier_initializer_conv2d
variance_scaling_initializer
constant_initializer
zeros_initializer
ones_initializer
uniform_unit_scaling_initializer
truncated_normal_initializer
random_uniform_initializer
random_normal_initializer
orthogonal_initializer
as well as a lot of initializers from keras

I looked and I couldn't find anything built in. However, according to this:
http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization
Xavier initialization is just sampling a (usually Gaussian) distribution where the variance is a function of the number of neurons. tf.random_normal can do that for you, you just need to compute the stddev (i.e. the number of neurons being represented by the weight matrix you're trying to initialize).

Via the kernel_initializer parameter to tf.layers.conv2d, tf.layers.conv2d_transpose, tf.layers.Dense etc
e.g.
layer = tf.layers.conv2d(
input, 128, 5, strides=2,padding='SAME',
kernel_initializer=tf.contrib.layers.xavier_initializer())
https://www.tensorflow.org/api_docs/python/tf/layers/conv2d
https://www.tensorflow.org/api_docs/python/tf/layers/conv2d_transpose
https://www.tensorflow.org/api_docs/python/tf/layers/Dense

Just in case you want to use one line as you do with:
W = tf.Variable(tf.truncated_normal((n_prev, n), stddev=0.1))
You can do:
W = tf.Variable(tf.contrib.layers.xavier_initializer()((n_prev, n)))

Tensorflow 1:
W1 = tf.get_variable("W1", [25, 12288],
initializer = tf.contrib.layers.xavier_initializer(seed=1)
Tensorflow 2:
W1 = tf.get_variable("W1", [25, 12288],
initializer = tf.random_normal_initializer(seed=1))

Related

Tensorflow 2.0 doesn't compute the gradient

I want to visualize the patterns that a given feature map in a CNN has learned (in this example I'm using vgg16). To do so I create a random image, feed through the network up to the desired convolutional layer, choose the feature map and find the gradients with the respect to the input. The idea is to change the input in such a way that will maximize the activation of the desired feature map. Using tensorflow 2.0 I have a GradientTape that follows the function and then computes the gradient, however the gradient returns None, why is it unable to compute the gradient?
import tensorflow as tf
import matplotlib.pyplot as plt
import time
import numpy as np
from tensorflow.keras.applications import vgg16
class maxFeatureMap():
def __init__(self, model):
self.model = model
self.optimizer = tf.keras.optimizers.Adam()
def getNumLayers(self, layer_name):
for layer in self.model.layers:
if layer.name == layer_name:
weights = layer.get_weights()
num = weights[1].shape[0]
return ("There are {} feature maps in {}".format(num, layer_name))
def getGradient(self, layer, feature_map):
pic = vgg16.preprocess_input(np.random.uniform(size=(1,96,96,3))) ## Creates values between 0 and 1
pic = tf.convert_to_tensor(pic)
model = tf.keras.Model(inputs=self.model.inputs,
outputs=self.model.layers[layer].output)
with tf.GradientTape() as tape:
## predicts the output of the model and only chooses the feature_map indicated
predictions = model.predict(pic, steps=1)[0][:,:,feature_map]
loss = tf.reduce_mean(predictions)
print(loss)
gradients = tape.gradient(loss, pic[0])
print(gradients)
self.optimizer.apply_gradients(zip(gradients, pic))
model = vgg16.VGG16(weights='imagenet', include_top=False)
x = maxFeatureMap(model)
x.getGradient(1, 24)

This is a common pitfall with GradientTape; the tape only traces tensors that are set to be "watched" and by default tapes will watch only trainable variables (meaning tf.Variable objects created with trainable=True). To watch the pic tensor, you should add tape.watch(pic) as the very first line inside the tape context.
Also, I'm not sure if the indexing (pic[0]) will work, so you might want to remove that -- since pic has just one entry in the first dimension it shouldn't matter anyway.
Furthermore, you cannot use model.predict because this returns a numpy array, which basically "destroys" the computation graph chain so gradients won't be backpropagated. You should simply use the model as a callable, i.e. predictions = model(pic).

Did you define your own loss function? Did you convert tensor to numpy in your loss function?
As a freshman, I also met the same problem:
When using tape.gradient(loss, variables), it turns out None because I convert tensor to numpy array in my own loss function. It seems to be a stupid but common mistake for freshman.

FYI: When GradientTape is not working, there is a possibility of TensorFlow issue. Checking the TF github if the TF functions being used have known issues would be one of the problem determinations.
Gradients do not exist for variables after tf.concat(). #37726.

Why my one-filter convolutional neural network is unable to learn a simple gaussian kernel?

I was surprised that the deep learning algorithms I had implemented did not work, and I decided to create a very simple example, to understand the functioning of CNN better. Here is my attempt of constructing a small CNN for a very simple task, which provides unexpected results.
I have implemented a simple CNN with only one layer of one filter. I have created a dataset of 5000 samples, the inputs x being 256x256 simulated images, and the outputs y being the corresponding blurred images (y = signal.convolvded2d(x,gaussian_kernel,boundary='fill',mode='same')).
Thus, I would like my CNN to learn the convolutional filter which would transform the original image into its blurred version. In other words, I would like my CNN to recover the gaussian filter I used to create the blurred images. Note: As I want to 'imitate' the convolution process such as it is described in the mathematical framework, I am using a gaussian filter which has the same size as my images: 256x256.
It seems to me quite an easy task, and nonetheless, the CNN is unable to provide the results I would expect. Please find below the code of my training function and the results.
# Parameters
size_image = 256
normalization = 1
sigma = 7
n_train = 4900
ind_samples_training =np.linspace(1, n_train, n_train).astype(int)
nb_epochs = 5
minibatch_size = 5
learning_rate = np.logspace(-3,-5,nb_epochs)
tf.reset_default_graph()
tf.set_random_seed(1)
seed = 3
n_train = len(ind_samples_training)
costs = []
# Create Placeholders of the correct shape
X = tf.placeholder(tf.float64, shape=(None, size_image, size_image, 1), name = 'X')
Y_blur_true = tf.placeholder(tf.float64, shape=(None, size_image, size_image, 1), name = 'Y_true')
learning_rate_placeholder = tf.placeholder(tf.float32, shape=[])
# parameters to learn --should be an approximation of the gaussian filter
filter_to_learn = tf.get_variable('filter_to_learn',\
shape = [size_image,size_image,1,1],\
dtype = tf.float64,\
initializer = tf.contrib.layers.xavier_initializer(seed = 0),\
trainable = True)
# Forward propagation: Build the forward propagation in the tensorflow graph
Y_blur_hat = tf.nn.conv2d(X, filter_to_learn, strides = [1,1,1,1], padding = 'SAME')
# Cost function: Add cost function to tensorflow graph
cost = tf.losses.mean_squared_error(Y_blur_true,Y_blur_hat,weights=1.0)
# Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer that minimizes the cost.
opt_adam = tf.train.AdamOptimizer(learning_rate=learning_rate_placeholder)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optimizer = opt_adam.minimize(cost)
# Initialize all the variables globally
init = tf.global_variables_initializer()
lr = learning_rate[0]
# Start the session to compute the tensorflow graph
with tf.Session() as sess:
# Run the initialization
sess.run(init)
# Do the training loop
for epoch in range(nb_epochs):
minibatch_cost = 0.
seed = seed + 1
permutation = list(np.random.permutation(n_train))
shuffled_ind_samples = np.array(ind_samples_training)[permutation]
# Learning rate update
if learning_rate.shape[0]>1:
lr = learning_rate[epoch]
nb_minibatches = int(np.ceil(n_train/minibatch_size))
for num_minibatch in range(nb_minibatches):
# Minibatch indices
ind_minibatch = shuffled_ind_samples[num_minibatch*minibatch_size:(num_minibatch+1)*minibatch_size]
# Loading of the original image (X) and the blurred image (Y)
minibatch_X, minibatch_Y = load_dataset_blur(ind_minibatch,size_image, normalization, sigma)
_ , temp_cost, filter_learnt = sess.run([optimizer,cost,filter_to_learn],\
feed_dict = {X:minibatch_X, Y_blur_true:minibatch_Y, learning_rate_placeholder: lr})
I have run the training on 5 epochs of 4900 samples, with a batch size equal to 5. The gaussian kernel has a variance of 7^2=49.
I have tried to initialize the filter to be learnt both with the xavier initiliazer method provided by tensorflow, and with the true values of the gaussian kernel we actually would like to learn. In both cases, the filter that is learnt results too different from the true gaussian one as it can be seen on the two images available at https://github.com/megalinier/Helsinki-project.

By examining the photos it seems like the network is learning OK, as the predicted image is not so far off the true label - for better results you can tweak some hyperparams but that is not the case.
I think what you are missing is the fact that different kernels can get quite similar results since it is a convolution.
Think about it, you are multiplying some matrix with another, and then summing all the results to create a new pixel. Now if the true label sum is 10, it could be a results of 2.5 + 2.5 + 2.5 + 2.5 and -10 + 10 + 10 + 0.
What I am trying to say, is that your network could be learning just fine, but you will get a different values in the conv kernel than the filter.

I think this would better serve as a comment as it's somewhat speculative, but it's too long...
Hard to say what exactly is wrong but there could be multiple culprits here. For one, squared error provides a weak signal in the case that target and prediction are already quite similar -- and while the xavier-initalized filter looks quite bad, the predicted (filtered) image isn't too far off the target. You could experiment with other metrics such as absolute error (e.g. 1-norm instead of 2-norm).
Second, adding regularization should help, i.e. add a weight penalty to the loss function to encourage the filter values to become small where they are not needed. As it is, what I suppose happens is: The random values in the filter average out to about 0, leading to a similar "filtering" effect as if they were actually all 0. As such, the learning algorithm doesn't have much incentive to actually pull them to 0. By adding a weight penalty, you provide this incentive.
Third, it could just be Adam messing up. It is known to provide "strange" non-optimal solutions in some very simple (e.g. convex) problems. Maybe try default Gradient Descent with learning rate decay (and possibly momentum).

keras implementation of Levenberg-Marquardt optimization algorithm as a custom optimizer

I am trying to implement Levenberg-Marquardt algorithm as a Keras optimizer as was described here but I have several problems, biggest one is with this error
TypeError: Tensor objects are not iterable when eager execution is not enabled. To iterate over this tensor use tf.map_fn.
After quick search I have found out this is connected to how tensorflow is running programs with graphs which I don't understand in details.I have found this answer useful from SO but its about loss function, not optimizer.
So to the point.
My attempt looks like this:
from keras.optimizers import Optimizer
from keras.legacy import interfaces
from keras import backend as K
class Leveberg_Marquardt(Optimizer):
def __init__(self, tau =1e-2 , lambda_1=1e-5, lambda_2=1e+2, **kwargs):
super(Leveberg_Marquardt, self).__init__(**kwargs)
with K.name_scope(self.__class__.__name__):
self.iterations = K.variable(0, dtype='int64', name='iterations')
self.tau = K.variable(tau,name ='tau')
self.lambda_1 = K.variable(lambda_1,name='lambda_1')
self.lambda_2 = K.variable(lambda_2,name='lambda_2')
#interfaces.legacy_get_updates_support
def get_updates(self, loss, params):
grads = self.get_gradients(loss,params)
self.updates = [K.update_add(self.iterations,1)]
error = [K.int_shape(m) for m in loss]
for p,g,err in zip(params,grads,error):
H = K.dot(g, K.transpose(g)) + self.tau * K.eye(K.max(g))
w = p - K.pow(H,-1) * K.dot(K.transpose(g),err) #ended at step 3 from http://mads.lanl.gov/presentations/Leif_LM_presentation_m.pdf
if self.tau > self.lambda_2:
w = w - 1/self.tau * err
if self.tau < self.lambda_1:
w = w - K.pow(H,-1) * err
# Apply constraints.
if getattr(p, 'constraint', None) is not None:
w = p.constraint(w)
self.updates.append(K.update_add(err, w))
return self.updates
def get_config(self):
config = {'tau':float(K.get_value(self.tau)),
'lambda_1':float(K.get_value(self.lambda_1)),
'lambda_2':float(K.get_value(self.lambda_2)),}
base_config = super(Leveberg_Marquardt, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
Q1 Can I fix this error without going deep into tensorflow (I wish I could do this by staying on Keras level)
Q2 Do I use keras backend in correct way?
I mean, in this line
H = K.dot(g, K.transpose(g)) + self.tau * K.eye(K.max(g))
I should use keras backend function, or numpy or pure python in order to run this code without problem that input data are numpy arrays?
Q3 This question is more about the algorith itself.
Do I even implement LMA correctly? I'm must say, I not sure how to deal with boundry conditions, tau/lambda values I have guessed, maybe you know better way?
I was trying to understand how every other optimizer in keras works, but even SGD code looks ambiguous to me.
Q4 Do I need to change in any way local file optimizers.py?
In order to run it properly I was initializing my optimizer with:
myOpt = Leveberg_Marquardt()
and then simply pass it to complie method. Yet after quick look at source code of optimizers.py I have found thera are places in code with explicity writted names of optimizers (e.g deserialize function). Is it important to extend this for my custom optimizer or I can leave it be?
I would really appreciate any help and direction of future actions.

Q1 Can I fix this error without going deep into tensorflow (I wish I
could do this by staying on Keras level)
A1 I believe even if this error is fixed there are still problems in the implementation of the algorithm that keras does not support for example, the error term f(x;w_0)-y from the document is not available to a keras optimizer.
Q2 Do I use keras backend in correct way?
A2 Yes you must use the keras backend for this calculation because g is a tensor object and not a numpy array. However, I believe the correct calculation for H should be H = K.dot(K.transpose(g), g) to take the Nx1 vector g and perform an outer product to produce an NxN matrix.
Q3 This question is more about the algorith itself.
A3 As stated in A1 I am not sure that keras supports the required inputs for this algorithm.
Q4 Do I need to change in any way local file optimizers.py?
A4 The provided line of code would run the optimizer if supplied as the optimizer argument to the model compile function of keras. The keras library supports calling the built in classes and functions by name for convenience.

How does `tf.keras.layers.ActivityRegularization` work and how to use it correctly?

In training of deep neural network, how can tf.keras.layers.ActivityRegularization be used to regularize output?
In my code, output have very large value. So I tried to regularize it. So for my last dense layers, I tried:
output = tf.layers.dense(inputs=dropout_dense1,
units=NUM_OUTPUTS,
kernel_initializer=tf.truncated_normal_initializer,
activity_regularizer = tf.keras.layers.ActivityRegularization())
But no regularization effect is observed in output (still very large). I tried all kinds of combination of parameters (default is tf.keras.layers.ActivityRegularization(L1=0, L2=0)), but it seems that it doesn't have any effect.

In your case, I think the proper method should be like this.
(Tensorflow version >= 2)
output = tf.keras.layers.Dense(units=NUM_OUTPUTS,
kernel_initializer=tf.keras.initializers.TruncatedNormal(mean=0., stddev=1.),
activity_regularizer=tf.keras.regularizers.L2(0.01))(dropout_dense1)
You can change the method, for example, from L2 to L1, or if you want to calculate regularizers your own way.
You can customize it yourself. Please see an example here.
Tensorflow 2 Developing new regularizers
but if you want to use tf.keras.layers.ActivityRegularization
you can use as follows
output = tf.keras.layers.Dense(units=NUM_OUTPUTS)(dropout_dense1)
output_reg = tf.keras.layers.Activation('relu')(output)
# Define ActivityRegularization layer
reg_output = tf.keras.layers.ActivityRegularization(l1=0.001, l2=0.001)
# Apply ActivityRegularization layer to the non-weight directly layer that you want
output_reg = reg_output(output_reg)
other_layer = tf.keras.layers.Dense(units=NUM)(output_reg)
final_output = tf.keras.layers.Activation('relu')(other_layer)
model = tf.keras.Model(input, final_output)

How to add L1-regularization to one hidden layer?

There have been some answers about adding L1-regularization to the Weights of one hidden. However what I want is not only the sparseness of Weight, but also the sparseness of the representation of one hidden layer. What I want is something like the code below. Is it feasible to be realized, or I need only to add L1-regularization on the Weights?
import tensorflow as tf
...
**HIDDEN** = tf.contrib.layers.dense(input_layer, n_nodes)
...
loss = meansq #or other loss calcuation
l1_regularizer = tf.contrib.layers.l1_regularizer(scale=0.005, scope=None)
regularization_penalty = tf.contrib.layers.apply_regularization(l1_regularizer, **HIDDEN**)
regularized_loss = loss + regularization_penalty
This idea is from the sparse representation of the book Deep Learning written by Goodfellow and Bengio.

If you are using tf.contrib.layers, the fully_connected function accepts weights_regularizer argument, so your code should look like thus
l1 = tf.contrib.layers.l1_regularizer(scale=0.005, scope=None)
hidden = tf.contrib.layers.fully_connected(inputs, n_nodes, weights_regularizer=l1)
That said, tf.contrib.layers has been mostly moved to the core API, so you should be using tf.layers.dense instead with kernel_regularizer argument.
The code above will regularize the weights in the layer. If you want to regularize both weights and the layer output, you can use the same tf.contrib.layers.l1_regularizer or create a different one with different parameters. Something like this should work for you:
l1 = tf.contrib.layers.l1_regularizer(scale=0.005, scope=None)
hidden = tf.contrib.layers.fully_connected(inputs, n_nodes, weights_regularizer=l1)
hidden_reg = l1(hidden)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to do Xavier initialization on TensorFlow - python

I'm porting my Caffe network over to TensorFlow but it doesn't seem to have xavier initialization. I'm using truncated_normal but this seems to be making it a lot harder to train.

Since version 0.8 there is a Xavier initializer, see here for the docs. You can use something like this: W = tf.get_variable("W", shape=[784, 256], initializer=tf.contrib.layers.xavier_initializer())

Just in case you want to use one line as you do with: W = tf.Variable(tf.truncated_normal((n_prev, n), stddev=0.1)) You can do: W = tf.Variable(tf.contrib.layers.xavier_initializer()((n_prev, n)))

Tensorflow 1: W1 = tf.get_variable("W1", [25, 12288], initializer = tf.contrib.layers.xavier_initializer(seed=1) Tensorflow 2: W1 = tf.get_variable("W1", [25, 12288], initializer = tf.random_normal_initializer(seed=1))

Related

Tensorflow 2.0 doesn't compute the gradient

Why my one-filter convolutional neural network is unable to learn a simple gaussian kernel?

keras implementation of Levenberg-Marquardt optimization algorithm as a custom optimizer

How does `tf.keras.layers.ActivityRegularization` work and how to use it correctly?

How to add L1-regularization to one hidden layer?

Categories

Resources