Get LR from cyclical learning rate in PyTorch

Get LR from cyclical learning rate in PyTorch - python

I'm trying to implement the cyclical learning rate approach on top of the PyTorch reimplementation of StyleGAN by rosinality. To do so, I am building on what suggested in this blog post.
To check how the loss changes as a function of the learning rate, I need to plot how the (LR, loss) changes. Here you can find my modified version of train.py. These are the main changes I made:
Define cyclical_lr, a function regulating the cyclical learning rate
def cyclical_lr(stepsize, min_lr, max_lr):
# Scaler: we can adapt this if we do not want the triangular CLR
scaler = lambda x: 1.
# Lambda function to calculate the LR
lr_lambda = lambda it: min_lr + (max_lr - min_lr) * relative(it, stepsize)
# Additional function to see where on the cycle we are
def relative(it, stepsize):
cycle = math.floor(1 + it / (2 * stepsize))
x = abs(it / stepsize - 2 * cycle + 1)
return max(0, (1 - x)) * scaler(cycle)
return lr_lambda
Implement the cyclical learning rate for both the discriminator and the generator
step_size = 5*256
end_lr = 10**-1
factor = 10**5
clr = cyclical_lr(step_size, min_lr=end_lr / factor, max_lr=end_lr)
scheduler_g = torch.optim.lr_scheduler.LambdaLR(g_optimizer, [clr, clr])
d_optimizer = optim.Adam(discriminator.parameters(), lr=args.lr, betas=(0.0, 0.99))
scheduler_d = torch.optim.lr_scheduler.LambdaLR(d_optimizer, [clr])
Do you have suggestions on how to plot how the loss changes as a function of the learning rate? Ideally, I would like to do it using TensorBoard, where for now I am plotting the generator loss, the discriminator loss and the size of the generated images as a function of the iteration number:
if (i + 1) % 100 == 0:
writer.add_scalar('Loss/G', gen_loss_val, i * args.batch.get(resolution))
writer.add_scalar('Loss/D', disc_loss_val, i * args.batch.get(resolution))
writer.add_scalar('Step/pixel_size', (4 * 2 ** step), i * args.batch.get(resolution))
print(args.batch.get(resolution))

Related

Loss function increasing instead of decreasing

I have been trying to make my own neural networks from scratch. After some time, I made it, but I run into a problem I cannot solve. I have been following a tutorial which shows how to do this. The problem I run into, was how my network updates weights and biases. Well, I know that gradient descent won't be always decreasing loss and for a few epochs it might even increase a bit, bit it still should decrease and work much better than mine does. Sometimes the whole process gets stuck on loss 9 and 13 and it cannot get out of it. I have checked many tutorials, videos and websites, but I couldn't find anything wrong in my code.
self.activate, self.dactivate, self.loss and self.dloss:
# sigmoid
self.activate = lambda x: np.divide(1, 1 + np.exp(-x))
self.dactivate = lambda x: np.multiply(self.activate(x), (1 - self.activate(x)))
# relu
self.activate = lambda x: np.where(x > 0, x, 0)
self.dactivate = lambda x: np.where(x > 0, 1, 0)
# loss I use (cross-entropy)
clip = lambda x: np.clip(x, 1e-10, 1 - 1e-10) # it's used to squeeze x into a probability between 0 and 1 (which I think is required)
self.loss = lambda x, y: -(np.sum(np.multiply(y, np.log(clip(x))) + np.multiply(1 - y, np.log(1 - clip(x))))/y.shape[0])
self.dloss = lambda x, y: -(np.divide(y, clip(x)) - np.divide(1 - y, 1 - clip(x)))
The code I use for forwardpropagation:
self.activate(np.dot(X, self.weights) + self.biases) # it's an example for first hidden layer
And that's the code for backpropagation:
First part, in DenseNeuralNetwork class:
last_derivative = self.dloss(output, y)
for layer in reversed(self.layers):
last_derivative = layer.backward(last_derivative, self.lr)
And the second part, in Dense class:
def backward(self, last_derivative, lr):
w = self.weights
dfunction = self.dactivate(last_derivative)
d_w = np.dot(self.layer_input.T, dfunction) * (1./self.layer_input.shape[1])
d_b = (1./self.layer_input.shape[1]) * np.dot(np.ones((self.biases.shape[0], last_derivative.shape[0])), last_derivative)
self.weights -= np.multiply(lr, d_w)
self.biases -= np.multiply(lr, d_b)
return np.dot(dfunction, w.T)
I have also made a repl so you can check the whole code and run it without any problems.

1.
line 12
self.dloss = lambda x, y: -(np.divide(y, clip(x)) - np.divide(1 - y, 1 - clip(x)))
if you're going to clip x, you shoud clip y too.
I mean there are some ways to implement this, but if you are going to use this way.
change to
self.dloss = lambda x, y: -(np.divide(clip(y), clip(x)) - np.divide(1 - clip(y), 1 - clip(x)))
2.
line 75
dfunction = self.dactivate(last_derivative)
this back propagation part is just wrong.
change to
dfunction = last_derivative*self.dactivate(np.dot(self.layer_input, self.weights) + self.biases)
3.
line 77
d_b = (1./self.layer_input.shape[1]) * np.dot(np.ones((self.biases.shape[0], last_derivative.shape[0])), last_derivative)
last_derivative should be dfunction. I think this is just a mistake.
change to
d_b = (1./self.layer_input.shape[1]) * np.dot(np.ones((self.biases.shape[0], last_derivative.shape[0])), dfunction)
4.
line 85
self.weights = np.random.randn(neurons, self.neurons) * np.divide(6, np.sqrt(self.neurons * neurons))
self.biases = np.random.randn(1, self.neurons) * np.divide(6, np.sqrt(self.neurons * neurons))
Not sure where you are going with this, but I think the initialized values are too big. We're not doing precise hypertuning, so I just made it small.
self.weights = np.random.randn(neurons, self.neurons) * np.divide(6, np.sqrt(self.neurons * neurons)) / 100
self.biases = np.random.randn(1, self.neurons) * np.divide(6, np.sqrt(self.neurons * neurons)) / 100
All good now
After this I changed the learning rate to 0.01 because it was to slow, and it worked fine.
I think you are misunderstanding back propagation. You should probably double check how it works. The other parts are ok I think.

This can be caused by your training data. Either it is too small or too many diverse labels (What i get from your code from the link you share).
I re-run your code several times and it produce different training performance. Sometimes the loss keeps decreasing until last epoch, some times it keep increasing, in one time it decreased until some point and it increasing. (With minimum loss achieved of 0.5)
I think it is your training data that matters this time. The learning rate is good enough though (Assuming you did the calculation for Linear combination, back propagation, etc right).

Multivariate Linear Regression Cost Too High

I was working on price prediction with the data set provided in this link, the imports-85.data.
With horsepower, curb-weight, engine-size and highway-mpg, I tried to normalize (due to the high cost) and run the gradient descent algorithm by implementing the following:
Initialization
data = df[attrs]
m = len(data) # m-training examples
f = len(attrs) # n-features
X = np.hstack((np.ones(shape=(m,1)),np.array(data)))
T = np.zeros(f + 1) # Coefficients of x(0),x(1),...x(n)
norm_price = df.price / 1000
Y = np.array(norm_price)
# Normalization
data['curb-weight'] = (data['curb-weight'] * 0.453592) / 1000 # To kg (e-1000)
data['highway-mpg'] = data['highway-mpg'] * 0.425144 # To km per litre (kml)
data['engine-size'] = data['engine-size'] / 100 # To e-100
data['horsepower'] = data['horsepower'] / 100 # To e-100
col_rename = {
'curb-weight':'curb-weight-kg(e-1000)',
'highway-mpg':'highway-kml',
'engine-size':'engine-size(e-100)',
'horsepower':'horsepower(e-100)'
}
data.rename(columns=col_rename,inplace=True)
Cost calculation
def calculateCost():
global m,T,X
hypot = (X.dot(T) - Y).transpose().dot(X.dot(T) - Y)
return hypot / (2 * m)
Gradient descent
def gradDescent(threshold,iter = 10000,alpha = 3e-8):
global T,X,Y,m
i = 0
cost = calculateCost()
cost_hist = [cost]
while i < iter:
T = T - (alpha / m) * X.transpose().dot(X.dot(T) - Y)
cost = calculateCost()
cost_hist.append(cost)
i += 1
if cost <= threshold:
return cost_hist
I ran the gradient descent with this implementation:
Batch Gradient Descent
Without normalization, the cost would be 118634960.460199.
With normalization, the cost would be 118.634960460199
As a result, I have a few questions:
Is my normalization technique correct?
After normalization, the cost would be different. How do I set the threshold for the cost after normalization?

I think you may be misunderstanding 'normalization' in the context of machine learning. From my interpretation of your code your 'normalization' section is doing unit conversions. Prior to gradient decent it is common to apply a max-min scaling or a standard scaling, see the scikit learn user guide. These techniques create features with a consistent scale range, so that changes in a single feature do not completely dominate the loss function. This question and this blog post have a longer discussion.

Keras loss function understanding

In order to understand some callbacks of Keras better, I want to artificially create a nan loss.
This is the function
def soft_dice_loss(y_true, y_pred):
from keras import backend as K
if K.eval(K.random_normal((1, 1), mean=2, stddev=2))[0][0] // 1 == 2.0:
# return nan
return K.exp(1.0) / K.exp(-10000000000.0) - K.exp(1.0) / K.exp(-10000000000.0)
epsilon = 1e-6
axes = tuple(range(1, len(y_pred.shape) - 1))
numerator = 2. * K.sum(y_pred * y_true, axes)
denominator = K.sum(K.square(y_pred) + K.square(y_true), axes)
return 1 - K.mean(numerator / (denominator + epsilon))
So normally, it calculates the dice loss, but from time to time it should randomly return a nan. However, this does not seem to happen:
From time to time though, when I try to run the code, it stops right at the start (before the first epoch) with an error, saying that An operation has None for gradient. Please make sure that all of your ops have a gradient defined
Does that mean, that the the random function of Keras is just evaluated once and then always returns the same value?
If so, why is that and how can I create a loss function that returns nan from time to time?

Your first conditional statement is only evaluated once the loss function is defined (i.e. called; that is why Keras stops right at the start). Instead, you could use keras.backend.switch to integrate your conditional into the graph's logic. Your loss function could be something along the lines of:
import keras.backend as K
import numpy as np
def soft_dice_loss(y_true, y_pred):
epsilon = 1e-6
axes = tuple(range(1, len(y_pred.shape) - 1))
numerator = 2. * K.sum(y_pred * y_true, axes)
denominator = K.sum(K.square(y_pred) + K.square(y_true), axes)
loss = 1 - K.mean(numerator / (denominator + epsilon))
return K.switch(condition=K.random_normal((), mean=0, stddev=1) > 3,
then_expression=K.variable(np.nan),
else_expression=loss)

How is Nesterov's Accelerated Gradient Descent implemented in Tensorflow?

The documentation for tf.train.MomentumOptimizer offers a use_nesterov parameter to utilise Nesterov's Accelerated Gradient (NAG) method.
However, NAG requires the gradient at a location other than that of the current variable to be calculated, and the apply_gradients interface only allows for the current gradient to be passed. So I don't quite understand how the NAG algorithm could be implemented with this interface.
The documentation says the following about the implementation:
use_nesterov: If True use Nesterov Momentum. See Sutskever et al.,
2013. This
implementation always computes gradients at the value of the
variable(s) passed to the optimizer. Using Nesterov Momentum makes the
variable(s) track the values called theta_t + mu*v_t in the paper.
Having read through the paper in the link, I'm a little unsure about whether this description answers my question or not. How can the NAG algorithm be implemented when the interface doesn't require a gradient function to be provided?

TL;DR
TF's implementation of Nesterov is indeed an approximation of the original formula, valid for high values of momentum.
Details
This is a great question. In the paper, the NAG update is defined as
vt+1 = μ.vt - λ.∇f(θt + μ.vt)
θt+1 = θt + vt+1
where f is our cost function, θt our parameters at time t, μ the momentum, λ the learning rate; vt is the NAG's internal accumulator.
The main difference with standard momentum is the use of the gradient at θt + μ.vt, not at θt. But as you said, tensorflow only uses gradient at θt. So what is the trick?
Part of the trick is actually mentioned in the part of the documentation you cited: the algorithm is tracking θt + μ.vt, not θt. The other part comes from an approximation valid for high value of momentum.
Let's make a slight change of notation from the paper for the accumulator to stick with tensorflow's definition. Let's define at = vt / λ. The update rules are changed slightly as
at+1 = μ.at - ∇f(θt + μ.λ.at)
θt+1 = θt + λ.at+1
(The motivation for this change in TF is that now a is a pure gradient momentum, independent of the learning rate. This makes the update process robust to changes in λ, a possibility common in practice but that the paper does not consider.)
If we note ψt = θt + μ.λ.at, then
at+1 = μ.at - ∇f(ψt)
ψt+1 = θt+1 + μ.λ.at+1
= θt + λ.at+1 + μ.λ.at+1
= ψt + λ.at+1 + μ.λ.(at+1 - at)
= ψt + λ.at+1 + μ.λ.[(μ-1)at - ∇f(ψt)]
≈ ψt + λ.at+1
This last approximation holds for strong values of momentum, where μ is close to 1, so that μ-1 is close to zero, and ∇f(ψt) is small compared to a — this last approximation is more debatable actually, and less valid for directions with frequent gradient switch.
We now have an update that uses the gradient of the current position, and the rules are pretty simple — they are in fact those of standard momentum.
However, we want θt, not ψt. This is the reason why we subtract μ.λ.at+1 to ψt+1 just before returning it — and to recover ψ it is added again first thing at the next call.

I couldn't see any info on this online, and the linked paper certainly wasn't helpful, so I had a look at the unit tests for tf.train.MomentumOptimizer, from which I can see tests for the implementation of both classic momentum and NAG modes.
Summary
var = var + accum * learning_rate * momentum
accum = accum * momentum + g
var = var - learning_rate * accum
var = var - accum * learning_rate * momentum
where accum starts at 0 and is updated at every step. The above is a modified version of the formulation in the unit test, and I find it a bit confusing. Here is the same set of equations arranged with my interpretation of what each of the parameters represent (I could be wrong though):
average_grad_0 = accum # previous rolling average
average_grad_1 = accum * momentum + g # updated rolling average
grad_diff = average_grad_1 - average_grad_0
adjustment = -learning_rate * (grad_diff * momentum + average_grad_1)
var += adjustment
accum = average_grad_new
In other words, it seems to me like tensorflow's implementation attempts to guess the "adjusted gradient" in NAG by assuming that the new gradient will be esimated by the current average gradient plus the product of momentum and the change in the average gradient. I'd love to see a proof for this!
What follows is more detail on how the classic and nesterov modes are implemented in tensorflow as per the tests.
Classic Momentum mode
For use_nesterov=False, based on the doTestBasic function, we have the following initial parameters:
learning_rate = 2.0
momentum = 0.9
var_0 = 1.0 # at time 0
grad = 0.1
Actually, the above are just the first element of the grads_0 and vars_0 arrays, but I'll just focus on a single value. For the subsequent timesteps, we have
var_1 = 1.0 - (0.1 * 2.0)
var_2 = 1.0 - (0.1 * 2.0) - ((0.9 * 0.1 + 0.1) * 2.0)
which I'm going to interpret as meaning;
var_1 = var_0 - (grad * learning_rate)
var_2 = var_1 - ((momentum * grad + grad) * learning_rate)
If we assume that for the purposes of the unit tests grad_0 == grad_1 == grad then this makes sense as a formulation of classic momentum.
Nesterov's Accelerated Gradient (NAG) mode
For use_nesterov=True, I had a look at the _update_nesterov_momentum_numpy function and the testNesterovMomentum test case.
The _update_nesterov_momentum_numpy function has the following definition:
def _update_nesterov_momentum_numpy(self, var, accum, g, lr, momentum):
var = var + accum * lr * momentum
accum = accum * momentum + g
var = var - lr * accum
var = var - accum * lr * momentum
return var, accum
and it is called in the unit tests like this:
for t in range(1, 5):
opt_op.run()
var0_np, accum0_np = self._update_nesterov_momentum_numpy(
var0_np, accum0_np, var0_np * 10, 2.0, 0.9)

Can I use fitted ML model as a part of a function in scipy.optimize.minimize?

Can I minimize this function using scipy.optimize?
def obj(x):
Budget = ((df['CPP TA 30'] / 30 * df['TVC']) * x).sum()
x = (x - min_train_x) / (max_train_x-min_train_x)
x = np.array([x])
return (0.05 * model.predict(x) - (1.7 * (Budget / 10**10)))[0][0]
x0 = np.random.uniform(size = 23)
x0 = (x0 / np.sum(x0)) * 1800
from scipy.optimize import minimize
res = minimize(obj, x0)

Definitely you can. Why not try?
If your model is smooth (e.g. linear or neural network), optimization will converge nicely. The only problem is that with nonlinear models there might be multiple local optima, so it would be safe to try different starting points or initialize with some heuristic.
If the model is not smooth (e.g. ensembles of trees) there might be problems with convergence, and you should use a gradient-free algorithm like Nelder-Mead.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.