How is Nesterov's Accelerated Gradient Descent implemented in Tensorflow?

How is Nesterov's Accelerated Gradient Descent implemented in Tensorflow? - python

The documentation for tf.train.MomentumOptimizer offers a use_nesterov parameter to utilise Nesterov's Accelerated Gradient (NAG) method.
However, NAG requires the gradient at a location other than that of the current variable to be calculated, and the apply_gradients interface only allows for the current gradient to be passed. So I don't quite understand how the NAG algorithm could be implemented with this interface.
The documentation says the following about the implementation:
use_nesterov: If True use Nesterov Momentum. See Sutskever et al.,
2013. This
implementation always computes gradients at the value of the
variable(s) passed to the optimizer. Using Nesterov Momentum makes the
variable(s) track the values called theta_t + mu*v_t in the paper.
Having read through the paper in the link, I'm a little unsure about whether this description answers my question or not. How can the NAG algorithm be implemented when the interface doesn't require a gradient function to be provided?

TL;DR
TF's implementation of Nesterov is indeed an approximation of the original formula, valid for high values of momentum.
Details
This is a great question. In the paper, the NAG update is defined as
vt+1 = μ.vt - λ.∇f(θt + μ.vt)
θt+1 = θt + vt+1
where f is our cost function, θt our parameters at time t, μ the momentum, λ the learning rate; vt is the NAG's internal accumulator.
The main difference with standard momentum is the use of the gradient at θt + μ.vt, not at θt. But as you said, tensorflow only uses gradient at θt. So what is the trick?
Part of the trick is actually mentioned in the part of the documentation you cited: the algorithm is tracking θt + μ.vt, not θt. The other part comes from an approximation valid for high value of momentum.
Let's make a slight change of notation from the paper for the accumulator to stick with tensorflow's definition. Let's define at = vt / λ. The update rules are changed slightly as
at+1 = μ.at - ∇f(θt + μ.λ.at)
θt+1 = θt + λ.at+1
(The motivation for this change in TF is that now a is a pure gradient momentum, independent of the learning rate. This makes the update process robust to changes in λ, a possibility common in practice but that the paper does not consider.)
If we note ψt = θt + μ.λ.at, then
at+1 = μ.at - ∇f(ψt)
ψt+1 = θt+1 + μ.λ.at+1
= θt + λ.at+1 + μ.λ.at+1
= ψt + λ.at+1 + μ.λ.(at+1 - at)
= ψt + λ.at+1 + μ.λ.[(μ-1)at - ∇f(ψt)]
≈ ψt + λ.at+1
This last approximation holds for strong values of momentum, where μ is close to 1, so that μ-1 is close to zero, and ∇f(ψt) is small compared to a — this last approximation is more debatable actually, and less valid for directions with frequent gradient switch.
We now have an update that uses the gradient of the current position, and the rules are pretty simple — they are in fact those of standard momentum.
However, we want θt, not ψt. This is the reason why we subtract μ.λ.at+1 to ψt+1 just before returning it — and to recover ψ it is added again first thing at the next call.

I couldn't see any info on this online, and the linked paper certainly wasn't helpful, so I had a look at the unit tests for tf.train.MomentumOptimizer, from which I can see tests for the implementation of both classic momentum and NAG modes.
Summary
var = var + accum * learning_rate * momentum
accum = accum * momentum + g
var = var - learning_rate * accum
var = var - accum * learning_rate * momentum
where accum starts at 0 and is updated at every step. The above is a modified version of the formulation in the unit test, and I find it a bit confusing. Here is the same set of equations arranged with my interpretation of what each of the parameters represent (I could be wrong though):
average_grad_0 = accum # previous rolling average
average_grad_1 = accum * momentum + g # updated rolling average
grad_diff = average_grad_1 - average_grad_0
adjustment = -learning_rate * (grad_diff * momentum + average_grad_1)
var += adjustment
accum = average_grad_new
In other words, it seems to me like tensorflow's implementation attempts to guess the "adjusted gradient" in NAG by assuming that the new gradient will be esimated by the current average gradient plus the product of momentum and the change in the average gradient. I'd love to see a proof for this!
What follows is more detail on how the classic and nesterov modes are implemented in tensorflow as per the tests.
Classic Momentum mode
For use_nesterov=False, based on the doTestBasic function, we have the following initial parameters:
learning_rate = 2.0
momentum = 0.9
var_0 = 1.0 # at time 0
grad = 0.1
Actually, the above are just the first element of the grads_0 and vars_0 arrays, but I'll just focus on a single value. For the subsequent timesteps, we have
var_1 = 1.0 - (0.1 * 2.0)
var_2 = 1.0 - (0.1 * 2.0) - ((0.9 * 0.1 + 0.1) * 2.0)
which I'm going to interpret as meaning;
var_1 = var_0 - (grad * learning_rate)
var_2 = var_1 - ((momentum * grad + grad) * learning_rate)
If we assume that for the purposes of the unit tests grad_0 == grad_1 == grad then this makes sense as a formulation of classic momentum.
Nesterov's Accelerated Gradient (NAG) mode
For use_nesterov=True, I had a look at the _update_nesterov_momentum_numpy function and the testNesterovMomentum test case.
The _update_nesterov_momentum_numpy function has the following definition:
def _update_nesterov_momentum_numpy(self, var, accum, g, lr, momentum):
var = var + accum * lr * momentum
accum = accum * momentum + g
var = var - lr * accum
var = var - accum * lr * momentum
return var, accum
and it is called in the unit tests like this:
for t in range(1, 5):
opt_op.run()
var0_np, accum0_np = self._update_nesterov_momentum_numpy(
var0_np, accum0_np, var0_np * 10, 2.0, 0.9)

Related

Get LR from cyclical learning rate in PyTorch

I'm trying to implement the cyclical learning rate approach on top of the PyTorch reimplementation of StyleGAN by rosinality. To do so, I am building on what suggested in this blog post.
To check how the loss changes as a function of the learning rate, I need to plot how the (LR, loss) changes. Here you can find my modified version of train.py. These are the main changes I made:
Define cyclical_lr, a function regulating the cyclical learning rate
def cyclical_lr(stepsize, min_lr, max_lr):
# Scaler: we can adapt this if we do not want the triangular CLR
scaler = lambda x: 1.
# Lambda function to calculate the LR
lr_lambda = lambda it: min_lr + (max_lr - min_lr) * relative(it, stepsize)
# Additional function to see where on the cycle we are
def relative(it, stepsize):
cycle = math.floor(1 + it / (2 * stepsize))
x = abs(it / stepsize - 2 * cycle + 1)
return max(0, (1 - x)) * scaler(cycle)
return lr_lambda
Implement the cyclical learning rate for both the discriminator and the generator
step_size = 5*256
end_lr = 10**-1
factor = 10**5
clr = cyclical_lr(step_size, min_lr=end_lr / factor, max_lr=end_lr)
scheduler_g = torch.optim.lr_scheduler.LambdaLR(g_optimizer, [clr, clr])
d_optimizer = optim.Adam(discriminator.parameters(), lr=args.lr, betas=(0.0, 0.99))
scheduler_d = torch.optim.lr_scheduler.LambdaLR(d_optimizer, [clr])
Do you have suggestions on how to plot how the loss changes as a function of the learning rate? Ideally, I would like to do it using TensorBoard, where for now I am plotting the generator loss, the discriminator loss and the size of the generated images as a function of the iteration number:
if (i + 1) % 100 == 0:
writer.add_scalar('Loss/G', gen_loss_val, i * args.batch.get(resolution))
writer.add_scalar('Loss/D', disc_loss_val, i * args.batch.get(resolution))
writer.add_scalar('Step/pixel_size', (4 * 2 ** step), i * args.batch.get(resolution))
print(args.batch.get(resolution))

Multivariate Linear Regression Cost Too High

I was working on price prediction with the data set provided in this link, the imports-85.data.
With horsepower, curb-weight, engine-size and highway-mpg, I tried to normalize (due to the high cost) and run the gradient descent algorithm by implementing the following:
Initialization
data = df[attrs]
m = len(data) # m-training examples
f = len(attrs) # n-features
X = np.hstack((np.ones(shape=(m,1)),np.array(data)))
T = np.zeros(f + 1) # Coefficients of x(0),x(1),...x(n)
norm_price = df.price / 1000
Y = np.array(norm_price)
# Normalization
data['curb-weight'] = (data['curb-weight'] * 0.453592) / 1000 # To kg (e-1000)
data['highway-mpg'] = data['highway-mpg'] * 0.425144 # To km per litre (kml)
data['engine-size'] = data['engine-size'] / 100 # To e-100
data['horsepower'] = data['horsepower'] / 100 # To e-100
col_rename = {
'curb-weight':'curb-weight-kg(e-1000)',
'highway-mpg':'highway-kml',
'engine-size':'engine-size(e-100)',
'horsepower':'horsepower(e-100)'
}
data.rename(columns=col_rename,inplace=True)
Cost calculation
def calculateCost():
global m,T,X
hypot = (X.dot(T) - Y).transpose().dot(X.dot(T) - Y)
return hypot / (2 * m)
Gradient descent
def gradDescent(threshold,iter = 10000,alpha = 3e-8):
global T,X,Y,m
i = 0
cost = calculateCost()
cost_hist = [cost]
while i < iter:
T = T - (alpha / m) * X.transpose().dot(X.dot(T) - Y)
cost = calculateCost()
cost_hist.append(cost)
i += 1
if cost <= threshold:
return cost_hist
I ran the gradient descent with this implementation:
Batch Gradient Descent
Without normalization, the cost would be 118634960.460199.
With normalization, the cost would be 118.634960460199
As a result, I have a few questions:
Is my normalization technique correct?
After normalization, the cost would be different. How do I set the threshold for the cost after normalization?

I think you may be misunderstanding 'normalization' in the context of machine learning. From my interpretation of your code your 'normalization' section is doing unit conversions. Prior to gradient decent it is common to apply a max-min scaling or a standard scaling, see the scikit learn user guide. These techniques create features with a consistent scale range, so that changes in a single feature do not completely dominate the loss function. This question and this blog post have a longer discussion.

How to handle the runtime overflow in exp of a geometric regression model objective function?

I am trying to implement a Scikit-learn compatible class for a geometric distribution regression model. However, I am getting an overflow runtime error in my objective function which has an exp component. I am also having a similar error in my objective_grad function which finds the gradient at every step.
The code for my objective function which calculates the objective value using the parameters is given below:
weight = wb[0:-1]
bias = wb[-1]
theta = X.dot(weight) + bias
obj = y.dot(theta)
obj += np.sum((y + 1).dot((np.log(1 + np.exp(-theta)))))
obj += self.lam * np.sum(np.square(weight))
return obj
I am trying to use log sum trick to solve this issue but I am not sure how to apply in this component: np.log(1+np.exp(-theta))
I have also shared the code for my gradient function which is also having similar issue for overflow:
weight = wb[:-1]
bias = wb[-1]
theta = X.dot(weight) + bias
common_grad = y / (1 + np.exp(-theta)) - 1 / (1 +
np.exp(theta))
dw = X.T.dot(common_grad) - 2 * self.lam * weight
db = -np.sum(common_grad) - 2 * self.lam * bias
dwb = np.hstack((dw, db))
return dwb
What should be my approach to handle the exp problem in this scenario?

I found the issue. In this line,
obj += np.sum((y + 1).dot((np.log(1 + np.exp(-theta))))),
np.log(1 + np.exp(-theta) was exploding.
I resolved it by using
-np.log(1/(1 + np.exp(theta))

scipy.optimize.minimize function with L-BFGS-B method maxiter attribute not working

I have a simple cost function, which I want to optimize using scipy.optimize.minimize function.
opt_solution = scipy.optimize.minimize(costFunction, theta, args = (training_data,), method = 'L-BFGS-B', jac = True, options = {'maxiter': 100)
where costFunction is the function to be optimized, theta are the parameters to be optimized. Inside costFunction, I printed the value of cost function. But the parameter maxiter seems to have no effect whether I increase value from 10 to 100000. The time it is taking is same. Also, I was expecting the printed value of cost function should be equal to the values of maxiter. So I am feeling maxiter has no effect. What might be the problem ?
Cost function is
def costFunction(self, theta, input):
""" Extract weights and biases from 'theta' input """
W1 = theta[self.limit0 : self.limit1].reshape(self.hidden_size, self.visible_size)
W2 = theta[self.limit1 : self.limit2].reshape(self.visible_size, self.hidden_size)
b1 = theta[self.limit2 : self.limit3].reshape(self.hidden_size, 1)
b2 = theta[self.limit3 : self.limit4].reshape(self.visible_size, 1)
""" Compute output layers by performing a feedforward pass
Computation is done for all the training inputs simultaneously """
hidden_layer = self.sigmoid(numpy.dot(W1, input) + b1)
output_layer = self.sigmoid(numpy.dot(W2, hidden_layer) + b2)
""" Compute intermediate difference values using Backpropagation algorithm """
diff = output_layer - input
sum_of_squares_error = 0.5 * numpy.sum(numpy.multiply(diff, diff)) / input.shape[1]
weight_decay = 0.5 * self.lamda * (numpy.sum(numpy.multiply(W1, W1)) + numpy.sum(numpy.multiply(W2, W2)))
cost = sum_of_squares_error + weight_decay
""" Compute the gradient values by averaging partial derivatives
Partial derivatives are averaged over all training examples """
W1_grad = numpy.dot(del_hid, numpy.transpose(input))
W2_grad = numpy.dot(del_out, numpy.transpose(hidden_layer))
b1_grad = numpy.sum(del_hid, axis = 1)
b2_grad = numpy.sum(del_out, axis = 1)
W1_grad = W1_grad / input.shape[1] + self.lamda * W1
W2_grad = W2_grad / input.shape[1] + self.lamda * W2
b1_grad = b1_grad / input.shape[1]
b2_grad = b2_grad / input.shape[1]
""" Transform numpy matrices into arrays """
W1_grad = numpy.array(W1_grad)
W2_grad = numpy.array(W2_grad)
b1_grad = numpy.array(b1_grad)
b2_grad = numpy.array(b2_grad)
""" Unroll the gradient values and return as 'theta' gradient """
theta_grad = numpy.concatenate((W1_grad.flatten(), W2_grad.flatten(),
b1_grad.flatten(), b2_grad.flatten()))
# Update counter value
self.counter += 1
print "Index ", self.counter, "cost ", cost
return [cost, theta_grad]

maxiter gives the maximum number of iterations that scipy will try before giving up on improving the solution. But it may very well be satisfied with a solution and stop earlier.
If you look at the docs for minimize when using the 'l-bfgs-b' method, notice there are three parameters you can pass as options (factr, ftol and gtol) that can also cause the iteration to stop.
In simple cases like yours, especially if your cost function also provides the gradient (as indicated by jac=True in your call), convergence typically happens in the first few iterations, hence way before the maxiter limit is reached.

Choose weights that minimize portfolio variance

I am looking for a method that chooses the weights that minimize the portfolio variance.
For example:
I have 3 assets; their returns are given in an array below:
import numpy as np
x = np.array([[0.2,-0.1,0.5,-0.2],[0, -0.9, 0.8, 0.2],[0.4,0.5,-0.3,-.01]])
I can weight them how I want to as long as sum of their weights adds to 1. I am looking for such weights that minimize variance of the portfolio.
Here are two examples of randomly chosen weights:
weight_1 = [0.3,0.3,0.4]
weighted_x_1 = [ele_x*ele_w for ele_x,ele_w in zip (x,weight_1)]
var_1 = np.var(sum(weighted_x_1))
weight_2 = [-0.2,0.4,0.8]
weighted_x_2 = [ele_x*ele_w for ele_x,ele_w in zip (x,weight_2)]
var_2 = np.var(sum(weighted_x_2))
Output:
>>> var_1
0.02351675000000001
>>> var_2
0.012071999999999999
The second way is better.
Is there a Python (or a Python library) method that could do this for me? If not any suggestions on what method I should use to do the above are welcome.
Thank You in Advance

Please see the accepted answer to this question: Finance Lib with portfolio optimization method in python
Relevant bit is here:
Here's a quote from a post I found.
Some research says that "mean variance portfolio optimization" can
give good results. I discussed this in a message
To implement this approach, a needed input is the covariance matrix of
returns, which requires historical stock prices, which one can obtain
using "Python quote grabber" http://www.openvest.org/Databases/ovpyq .
For expected returns -- hmmm. One of the papers I cited found that
assuming equal expected returns of all stocks can give reasonable
results.
Then one needs a "quadratic programming" solver, which appears to be
handled by the CVXOPT Python package.
If someone implements the approach in Python, I'd be happy to hear
about it.
There is a "backtest" package in R (open source stats package callable
from Python) http://cran.r-project.org/web/packages/backtest/index.html
"for exploring portfolio-based hypotheses about financial instruments
(stocks, bonds, swaps, options, et cetera)."

My full solution can be viewed in PDF.
The trick is to put the vectors x_i as columns of a matrix X.
Then writing the problem becomes a Convex Problem with constrain of the solution to be on the Unit Simplex.
I solved it using Projected Sub Gradient Method.
I calculated the Gradient of the objective function and created a projection to the Unit Simplex.
Now all needed is to iterate them.
I validated my solution using CVX.
% StackOverflow 44984132
% How to calculate weight to minimize variance?
% Remarks:
% 1. sa
% TODO:
% 1. ds
% Release Notes
% - 1.0.000 08/07/2017
% * First release.
%% General Parameters
run('InitScript.m');
figureIdx = 0; %<! Continue from Question 1
figureCounterSpec = '%04d';
generateFigures = OFF;
%% Simulation Parameters
dimOrder = 3;
numSamples = 4;
mX = randi([1, 10], [dimOrder, numSamples]);
vE = ones([dimOrder, 1]);
%% Solve Using CVX
cvx_begin('quiet')
cvx_precision('best');
variable vW(numSamples)
minimize( (0.5 * sum_square_abs( mX * vW - (1 / numSamples) * (vE.' * mX * vW) * vE )) )
subject to
sum(vW) == 1;
vW >= 0;
cvx_end
disp([' ']);
disp(['CVX Solution - [ ', num2str(vW.'), ' ]']);
%% Solve Using Projected Sub Gradient
numIterations = 20000;
stepSize = 0.001;
simplexRadius = 1; %<! Unit Simplex Radius
stopThr = 1e-6;
hKernelFun = #(vW) ((mX * vW) - ((1 / numSamples) * ((vE.' * mX * vW) * vE)));
hObjFun = #(vW) 0.5 * sum(hKernelFun(vW) .^ 2);
hGradFun = #(vW) (mX.' * hKernelFun(vW)) - ((1 / numSamples) * vE.' * (hKernelFun(vW)) * mX.' * vE);
vW = rand([numSamples, 1]);
vW = vW(:) / sum(vW);
for ii = 1:numIterations
vGradW = hGradFun(vW);
vW = vW - (stepSize * vGradW);
% Projecting onto the Unit Simplex
% sum(vW) == 1, vW >= 0.
vW = ProjectSimplex(vW, simplexRadius, stopThr);
end
disp([' ']);
disp(['Projected Sub Gradient Solution - [ ', num2str(vW.'), ' ]']);
%% Restore Defaults
% set(0, 'DefaultFigureWindowStyle', 'normal');
% set(0, 'DefaultAxesLooseInset', defaultLoosInset);
You can see the full code in StackOverflow Q44984132 Repository (PDF is available as well).
The solution was taken from StackOverflow Q44984132.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.