keras implementation of Levenberg-Marquardt optimization algorithm as a custom optimizer

keras implementation of Levenberg-Marquardt optimization algorithm as a custom optimizer - python

I am trying to implement Levenberg-Marquardt algorithm as a Keras optimizer as was described here but I have several problems, biggest one is with this error
TypeError: Tensor objects are not iterable when eager execution is not enabled. To iterate over this tensor use tf.map_fn.
After quick search I have found out this is connected to how tensorflow is running programs with graphs which I don't understand in details.I have found this answer useful from SO but its about loss function, not optimizer.
So to the point.
My attempt looks like this:
from keras.optimizers import Optimizer
from keras.legacy import interfaces
from keras import backend as K
class Leveberg_Marquardt(Optimizer):
def __init__(self, tau =1e-2 , lambda_1=1e-5, lambda_2=1e+2, **kwargs):
super(Leveberg_Marquardt, self).__init__(**kwargs)
with K.name_scope(self.__class__.__name__):
self.iterations = K.variable(0, dtype='int64', name='iterations')
self.tau = K.variable(tau,name ='tau')
self.lambda_1 = K.variable(lambda_1,name='lambda_1')
self.lambda_2 = K.variable(lambda_2,name='lambda_2')
#interfaces.legacy_get_updates_support
def get_updates(self, loss, params):
grads = self.get_gradients(loss,params)
self.updates = [K.update_add(self.iterations,1)]
error = [K.int_shape(m) for m in loss]
for p,g,err in zip(params,grads,error):
H = K.dot(g, K.transpose(g)) + self.tau * K.eye(K.max(g))
w = p - K.pow(H,-1) * K.dot(K.transpose(g),err) #ended at step 3 from http://mads.lanl.gov/presentations/Leif_LM_presentation_m.pdf
if self.tau > self.lambda_2:
w = w - 1/self.tau * err
if self.tau < self.lambda_1:
w = w - K.pow(H,-1) * err
# Apply constraints.
if getattr(p, 'constraint', None) is not None:
w = p.constraint(w)
self.updates.append(K.update_add(err, w))
return self.updates
def get_config(self):
config = {'tau':float(K.get_value(self.tau)),
'lambda_1':float(K.get_value(self.lambda_1)),
'lambda_2':float(K.get_value(self.lambda_2)),}
base_config = super(Leveberg_Marquardt, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
Q1 Can I fix this error without going deep into tensorflow (I wish I could do this by staying on Keras level)
Q2 Do I use keras backend in correct way?
I mean, in this line
H = K.dot(g, K.transpose(g)) + self.tau * K.eye(K.max(g))
I should use keras backend function, or numpy or pure python in order to run this code without problem that input data are numpy arrays?
Q3 This question is more about the algorith itself.
Do I even implement LMA correctly? I'm must say, I not sure how to deal with boundry conditions, tau/lambda values I have guessed, maybe you know better way?
I was trying to understand how every other optimizer in keras works, but even SGD code looks ambiguous to me.
Q4 Do I need to change in any way local file optimizers.py?
In order to run it properly I was initializing my optimizer with:
myOpt = Leveberg_Marquardt()
and then simply pass it to complie method. Yet after quick look at source code of optimizers.py I have found thera are places in code with explicity writted names of optimizers (e.g deserialize function). Is it important to extend this for my custom optimizer or I can leave it be?
I would really appreciate any help and direction of future actions.

Q1 Can I fix this error without going deep into tensorflow (I wish I
could do this by staying on Keras level)
A1 I believe even if this error is fixed there are still problems in the implementation of the algorithm that keras does not support for example, the error term f(x;w_0)-y from the document is not available to a keras optimizer.
Q2 Do I use keras backend in correct way?
A2 Yes you must use the keras backend for this calculation because g is a tensor object and not a numpy array. However, I believe the correct calculation for H should be H = K.dot(K.transpose(g), g) to take the Nx1 vector g and perform an outer product to produce an NxN matrix.
Q3 This question is more about the algorith itself.
A3 As stated in A1 I am not sure that keras supports the required inputs for this algorithm.
Q4 Do I need to change in any way local file optimizers.py?
A4 The provided line of code would run the optimizer if supplied as the optimizer argument to the model compile function of keras. The keras library supports calling the built in classes and functions by name for convenience.

Related

How to migrate tf.compat.v1.variable_scope() to TF2

I need to convert the following model to TF2. get_outputs() applies several convolution layers, then flat and dense layers and produces some outputs.
with tf.compat.v1.variable_scope(scope_name, reuse=tf.compat.v1.AUTO_REUSE):
outputs = get_outputs()
The following lines apply an exponential moving average to the trainable variables.
params = tf.compat.v1.trainable_variables(scope_name)
ema = tf.train.ExponentialMovingAverage(alpha)
ema_apply_op = ema.apply(params)
def custom_getter(getter, *args, **kwargs):
return ema.average(getter(*args, **kwargs))
This is an identical model which outputs the same outputs after somehow using the custom getter defined above.
with tf.compat.v1.variable_scope(
scope_name, reuse=True, custom_getter=custom_getter
):
avg_outputs = get_outputs()
And after each training iteration:
with tf.control_dependencies([opt_op]): # opt_op being the result of optimizer.apply_gradients()
grouped = tf.group(ema_apply_op)
And then grouped is run in the default session. I need some way to convert this to TF2. I tried the following and I'm not sure whether this works or not, for some reason the translated to TF2 model (keras) does not converge. I'm trying to find out why by eliminating potential sources causing the issue.
model = create_keras_model() # Keras model, results in the same outputs by TF1 model.
avg_model = create_keras_model() # Identical model
And after each training iteration, I do the following:
avg_weights = []
for w1, w2 in zip(model.trainable_variables, avg_model.trainable_variables):
avg_weights.append(alpha * w2 + (1 - decay) * w1)
avg_model.set_weights(avg_weights)
Due to model convergence issues, I cannot determine whether this works or not. I also tried replicating the same steps using tf.train.ExponentialMovingAverage however, it doesn't do anything in TF2. When ema.apply() and ema.average() are called on the training variables, nothing happens, their values remain unchanged.

Training with parametric partial derivatives in pytorch

Given a neural network with weights theta and inputs x, I am interested in calculating the partial derivatives of the neural network's output w.r.t. x, so that I can use the result when training the weights theta using a loss depending both on the output and the partial derivatives of the output. I figured out how to calculate the partial derivatives following this post. I also found this post that explains how to use sympy to achieve something similar, however, adapting it to a neural network context within pytorch seems like a huge amount of work and a recipee for very slow code.
Thus, I tried something different, which failed. As a minimal example, I created a function (substituting my neural network)
theta = torch.ones([3], requires_grad=True, dtype=torch.float32)
def trainable_function(time):
return theta[0]*time**3 + theta[1]*time**2 + theta[2]*time
Then, I defined a second function to give me partial derivatives:
def trainable_derivative(time):
deriv_time = torch.tensor(time, requires_grad=True)
fun_value = trainable_function(deriv_time)
gradient = torch.autograd.grad(fun_value, deriv_time, create_graph=True, retain_graph=True)
deriv_time.requires_grad = False
return gradient
Given some noisy observations of the derivatives, I now try to train theta. For simplicity, I create a loss that only depends on the derivatives. In this minimal example, the derivatives are used directly as observations, not as regularization, to avoid complicated loss functions that are besides the point.
def objective(train_times, observations):
predictions = torch.squeeze(torch.tensor([trainable_derivative(a) for a in train_times]))
return torch.sum((predictions - observations)**2)
optimizer = Adam([theta], lr=0.1)
for iteration in range(200):
optimizer.zero_grad()
loss = objective(data_times, noisy_targets)
loss.backward()
optimizer.step()
Unfortunately, when running this code, I get the error
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
I suppose that when calculating the partial derivatives in the way I do, I do not really create a computational graph through which autodiff could differentiate through. Thus, the connection to the parameters theta somehow gets lost and now it looks to the optimizer as if the loss is completely independent of the parameters theta. However, I could be totally wrong..
Does anyone know how to fix this?
Is it possible to include this type of derivatives in the loss function in pytorch?
And if so, what would be the most pytorch-style way of doing this?
Many thanks for your help and advise, it is much appreciated.
For completeness:
To run the above code, some training data needs to be generated. I used the following code, which works perfectly and has been tested against the analytical derivatives:
true_a = 1
true_b = 1
true_c = 1
def true_function(time):
return true_a*time**3 + true_b*time**2 + true_c*time
def true_derivative(time):
deriv_time = torch.tensor(time, requires_grad=True)
fun_value = true_function(deriv_time)
return torch.autograd.grad(fun_value, deriv_time)
data_times = torch.linspace(0, 1, 500)
true_targets = torch.squeeze(torch.tensor([true_derivative(a) for a in data_times]))
noisy_targets = torch.tensor(true_targets) + torch.randn_like(true_targets)*0.1

Your approach to the problem appears overly complicated.
I believe that what you're trying to achieve is within reach in PyTorch.
I include here a simple code snippet that I believe showcases what you would like to do:
import torch
import torch.nn as nn
# Data and Function
torch.manual_seed(0)
input_dim = 1
output_dim = 2
n = 10 # batchsize
simple_function = nn.Sequential(nn.Linear(1, 2), nn.Sigmoid())
t = (torch.arange(n).float() / n).view(n, 1)
x = torch.randn(n, output_dim)
t.requires_grad = True
# Actual computation
xhat = simple_function(t)
jac = torch.autograd.functional.jacobian(simple_function, t, create_graph=True)
grad = jac[torch.arange(n),:,torch.arange(n),0]
loss = (x -xhat).pow(2).sum() + grad.pow(2).sum()
loss.backward()

Optimizing a function involving tf.keras's "model.predict()" using TensorFlow optimizers?

I used tf.keras to build a fully-connected ANN, "my_model". Then, I'm trying to minimize a function f(x) = my_model.predict(x) - 0.5 + g(x) using Adam optimizer from TensorFlow. I tried the below code:
x = tf.get_variable('x', initializer = np.array([1.5, 2.6]))
f = my_model.predict(x) - 0.5 + g(x)
optimizer = tf.train.AdamOptimizer(learning_rate=.001).minimize(f)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(50):
print(sess.run([x,f]))
sess.run(optimizer)
However, I'm getting the following error when my_model.predict(x) is executed:
If your data is in the form of symbolic tensors, you should specify the steps argument (instead of the batch_size argument)
I understand what the error is but I'm unable to figure out how to make my_model.predict(x) work in the presence of symbolic tensors. If my_model.predict(x) is removed from the function f(x), the code runs without any error.
I checked the following link, link where TensorFlow optimizers are used to minimize an arbitrary function, but I think my problem is with the usage of underlying keras's model.predict() function. I appreciate any help. Thanks in advance!

I found the answer!
Basically, I was trying to optimize a function involving a trained ANN w.r.t the input variables to the ANN. So, all I wanted was to know how to call my_model and put it in f(x). Digging a bit into the Keras documentation here: https://keras.io/getting-started/functional-api-guide/, I found that all Keras models are callable just like the layers of the models! Quoting the information from the link,
..you can treat any model as if it were a layer, by calling it on a
tensor. Note that by calling a model you aren't just reusing the
architecture of the model, you are also reusing its weights.
Meanwhile, the model.predict(x) part expects x to be numpy arrays or evaluated tensors and does not take tensorflow variables as inputs (https://www.tensorflow.org/api_docs/python/tf/keras/Model#predict).
So the following code worked:
## initializations
sess = tf.InteractiveSession()
x_init_value = np.array([1.5, 2.6])
x_placeholder = tf.placeholder(tf.float32)
x_var = tf.Variable(x_init_value, dtype=tf.float32)
# Check calling my_model
assign_step = tf.assign(x_var, x_placeholder)
sess.run(assign_step, feed_dict={x_placeholder: x_init_value})
model_output = my_model(x_var) # This simple step is all I wanted!
sess.run(model_output) # This outputs my_model's predicted value for input x_init_value
# Now, define the objective function that has to be minimized
f = my_model(x_var) - 0.5 + g(x_var) # g(x_var) is some function of x_var
# Define the optimizer
optimizer = tf.train.AdamOptimizer(learning_rate=.001).minimize(f)
# Run the optimization steps
for i in range(50): # for 50 steps
_,loss = optimizer.minimize(f, var_list=[x_var])
print("step: ", i+1, ", loss: ", loss, ", X: ", x_var.eval()))

Compute gradients for each time step of tf.while_loop

Given a TensorFlow tf.while_loop, how can I calculate the gradient of x_out with respect to all weights of the network for each time step?
network_input = tf.placeholder(tf.float32, [None])
steps = tf.constant(0.0)
weight_0 = tf.Variable(1.0)
layer_1 = network_input * weight_0
def condition(steps, x):
return steps <= 5
def loop(steps, x_in):
weight_1 = tf.Variable(1.0)
x_out = x_in * weight_1
steps += 1
return [steps, x_out]
_, x_final = tf.while_loop(
condition,
loop,
[steps, layer_1]
)
Some notes
In my network the condition is dynamic. Different runs are going to run the while loop a different amount of times.
Calling tf.gradients(x, tf.trainable_variables()) crashes with AttributeError: 'WhileContext' object has no attribute 'pred'. It seems like the only possibility to use tf.gradients within the loop is to calculate the gradient with respect to weight_1 and the current value of x_in / time step only without backpropagating through time.
In each time step, the network is going to output a probability distribution over actions. The gradients are then needed for a policy gradient implementation.

You can't ever call tf.gradients inside tf.while_loop in Tensorflow based on this and this, I found this out the hard way when I was trying to create conjugate gradient descent entirely into the Tensorflow graph.
But if I understand your model correctly, you could make your own version of an RNNCell and wrap it in a tf.dynamic_rnn, but the actual cell
implementation will be a little complex since you need to evaluate a condition dynamically at runtime.
For starters, you can take a look at Tensorflow's dynamic_rnn code here.
Alternatively, dynamic graphs have never been Tensorflow's strong suite, so consider using other frameworks like PyTorch or you can try out eager_execution and see if that helps.

How to do Xavier initialization on TensorFlow

I'm porting my Caffe network over to TensorFlow but it doesn't seem to have xavier initialization. I'm using truncated_normal but this seems to be making it a lot harder to train.

Since version 0.8 there is a Xavier initializer, see here for the docs.
You can use something like this:
W = tf.get_variable("W", shape=[784, 256],
initializer=tf.contrib.layers.xavier_initializer())

Just to add another example on how to define a tf.Variable initialized using Xavier and Yoshua's method:
graph = tf.Graph()
with graph.as_default():
...
initializer = tf.contrib.layers.xavier_initializer()
w1 = tf.Variable(initializer(w1_shape))
b1 = tf.Variable(initializer(b1_shape))
...
This prevented me from having nan values on my loss function due to numerical instabilities when using multiple layers with RELUs.

In Tensorflow 2.0 and further both tf.contrib.* and tf.get_variable() are deprecated. In order to do Xavier initialization you now have to switch to:
init = tf.initializers.GlorotUniform()
var = tf.Variable(init(shape=shape))
# or a oneliner with a little confusing brackets
var = tf.Variable(tf.initializers.GlorotUniform()(shape=shape))
Glorot uniform and Xavier uniform are two different names of the same initialization type. If you want to know more about how to use initializations in TF2.0 with or without Keras refer to documentation.

#Aleph7, Xavier/Glorot initialization depends the number of incoming connections (fan_in), number outgoing connections (fan_out), and kind of activation function (sigmoid or tanh) of the neuron. See this: http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf
So now, to your question. This is how I would do it in TensorFlow:
(fan_in, fan_out) = ...
low = -4*np.sqrt(6.0/(fan_in + fan_out)) # use 4 for sigmoid, 1 for tanh activation
high = 4*np.sqrt(6.0/(fan_in + fan_out))
return tf.Variable(tf.random_uniform(shape, minval=low, maxval=high, dtype=tf.float32))
Note that we should be sampling from a uniform distribution, and not the normal distribution as suggested in the other answer.
Incidentally, I wrote a post yesterday for something different using TensorFlow that happens to also use Xavier initialization. If you're interested, there's also a python notebook with an end-to-end example: https://github.com/delip/blog-stuff/blob/master/tensorflow_ufp.ipynb

A nice wrapper around tensorflow called prettytensor gives an implementation in the source code (copied directly from here):
def xavier_init(n_inputs, n_outputs, uniform=True):
"""Set the parameter initialization using the method described.
This method is designed to keep the scale of the gradients roughly the same
in all layers.
Xavier Glorot and Yoshua Bengio (2010):
Understanding the difficulty of training deep feedforward neural
networks. International conference on artificial intelligence and
statistics.
Args:
n_inputs: The number of input nodes into each output.
n_outputs: The number of output nodes for each input.
uniform: If true use a uniform distribution, otherwise use a normal.
Returns:
An initializer.
"""
if uniform:
# 6 was used in the paper.
init_range = math.sqrt(6.0 / (n_inputs + n_outputs))
return tf.random_uniform_initializer(-init_range, init_range)
else:
# 3 gives us approximately the same limits as above since this repicks
# values greater than 2 standard deviations from the mean.
stddev = math.sqrt(3.0 / (n_inputs + n_outputs))
return tf.truncated_normal_initializer(stddev=stddev)

TF-contrib has xavier_initializer. Here is an example how to use it:
import tensorflow as tf
a = tf.get_variable("a", shape=[4, 4], initializer=tf.contrib.layers.xavier_initializer())
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print sess.run(a)
In addition to this, tensorflow has other initializers:
xavier_initializer_conv2d
variance_scaling_initializer
constant_initializer
zeros_initializer
ones_initializer
uniform_unit_scaling_initializer
truncated_normal_initializer
random_uniform_initializer
random_normal_initializer
orthogonal_initializer
as well as a lot of initializers from keras

I looked and I couldn't find anything built in. However, according to this:
http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization
Xavier initialization is just sampling a (usually Gaussian) distribution where the variance is a function of the number of neurons. tf.random_normal can do that for you, you just need to compute the stddev (i.e. the number of neurons being represented by the weight matrix you're trying to initialize).

Via the kernel_initializer parameter to tf.layers.conv2d, tf.layers.conv2d_transpose, tf.layers.Dense etc
e.g.
layer = tf.layers.conv2d(
input, 128, 5, strides=2,padding='SAME',
kernel_initializer=tf.contrib.layers.xavier_initializer())
https://www.tensorflow.org/api_docs/python/tf/layers/conv2d
https://www.tensorflow.org/api_docs/python/tf/layers/conv2d_transpose
https://www.tensorflow.org/api_docs/python/tf/layers/Dense

Just in case you want to use one line as you do with:
W = tf.Variable(tf.truncated_normal((n_prev, n), stddev=0.1))
You can do:
W = tf.Variable(tf.contrib.layers.xavier_initializer()((n_prev, n)))

Tensorflow 1:
W1 = tf.get_variable("W1", [25, 12288],
initializer = tf.contrib.layers.xavier_initializer(seed=1)
Tensorflow 2:
W1 = tf.get_variable("W1", [25, 12288],
initializer = tf.random_normal_initializer(seed=1))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

keras implementation of Levenberg-Marquardt optimization algorithm as a custom optimizer - python

Related

How to migrate tf.compat.v1.variable_scope() to TF2

Training with parametric partial derivatives in pytorch

Optimizing a function involving tf.keras's "model.predict()" using TensorFlow optimizers?

Compute gradients for each time step of tf.while_loop

How to do Xavier initialization on TensorFlow

Categories

Resources