The initial Gradient in Gradient Descent is abysmal and wrong

The initial Gradient in Gradient Descent is abysmal and wrong - python

I am building in Python using IDLE 3.9 an input optimiser, which optimises a certain input "thetas" for a certain target output "targetRes" versus a calculated output obtained from running a model with input "thetas" in a solver. The way it works is that a model is first defined with a function called FEA(thetas, fem). After running the solver, FEA returns the output.
The chosen optimisation algorithm is Gradient Descent. FEA (output) is taken as the hypothesis function, and the target output is subtracted from it. The result is then squared to give the loss function. The gradient of the loss function is then determined using FDM. The update step then takes place. For now, I am only running the algorithm on thetas[0]. Below is the GD code:
targetRes = -0.1
thetas = [1000., 1., 1., 1., 1.] # input here initial value of unknown vector theta
def LF(thetas):
return (FEA(thetas, fem) - targetRes) ** 2 / 2
def FDM(thetas, LF):
fdm = []
for i, theta in enumerate(thetas):
h = 0.1
if i == 0:
print(h)
thetas_p_h = []
for t in thetas:
thetas_p_h.append(t)
thetas_p_h[i] += h
thetas_m_h = []
for t in thetas:
thetas_m_h.append(t)
thetas_m_h[i] -= h
grad = (LF(thetas_p_h) - LF(thetas_m_h)) / (2 * h)
fdm.append(grad)
return fdm
def GD(thetas, LF):
tol = 0.000001
alpha = 10000000
Nmax = 1000
for n in range(Nmax):
gradient = FDM(thetas, LF)
thetas_new = []
for gradient_item, theta_item in zip(gradient, thetas):
t_new = theta_item - (alpha * gradient_item)
thetas_new.append(t_new)
print(thetas, f'gradient = {gradient}', LF(thetas), thetas_new, LF(thetas_new))
if tol >= abs(LF(thetas_new) - LF(thetas)):
print(f"solution converged in {n} iterations, theta = {thetas_new}, LF(theta) = {LF(thetas_new)}, FEA(theta) = {FEA(thetas_new, fem)}")
return thetas_new
thetas = thetas_new
else:
print(f"reached max iterations, Nmax = {Nmax}")
return None
GD(thetas, LF)
As you can see the gradient descent algorithm I am using is different to the linear regression type, and that is because there are no features to evaluate, just labels (y). Unfortunately I am not allowed to provide the solver code and most likely not allowed to provide the model code as well, defined in FEA.
In the current example, the calculated initial output is -0.070309. My issues are:
The gradient is very minute at the first update iteration, and it is 2.078369999999961e-06, and the final update iteration gradient value is 1.834250000000102e-08. In fact the first update iteration gradient value is most likely wrong, as I attempted to calculate it by hand and I got a value of somewhere around 0.000005, which is still tiny.
I am using a gigantic learning rate value, because the algorithm is horribly slow without such.
The algorithm converges to target output only with certain input values. There was another case where I had other values of thetas and the algorithm at zeroth iteration produces a loss function of order of 10^7, and at the first iteration it drops right down to zero and converges, which is unrealistic. In both cases I use a large learning rate, and both cases converge to the target output. In some other cases I have other inputs and the algorithm does not converge to the target output. Other cases lead to a negative value of thetas[0] which causes the solver to fail.
Also dismiss the fact that I am not using any external libraries; it is intended. So of course without trying to run the code, any observations? Does anyone see anything obvious that I'm missing here? As I think this could be the issue. Could it be due to the orders of magnitude of the inputs? What are your thoughts? (there are no issues with the solver or model function named FEA, both have been repeatedly verified to work perfectly fine).

Related

Gradient descent function always returns a parameter vector with Nan values

I am trying to apply a simple optimization by using gradient descent. In particular, I want to calulate the vector of parameters (Theta) that minimize the cost function (Mean Squared Error).
The gradient descent function looks like this:
eta = 0.1 # learning rate
n_iterations = 1000
m = 100
theta = np.random.randn(2,1) # random initialization
for iteration in range(n_iterations):
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y) #this is the partial derivate of the cost function
theta = theta - eta * gradients
Where X_b and y are respectively the input matrix and the target vector.
Now, if I take a look at my final theta, it is always equal to [[nan],
[nan]], while it should be equal to [[85.4575313 ],
[ 0.11802224]] (obtained by using both np.linalg and ScikitLearn LinearRegression).
In order to get a numeric result, I have to reduce the learning rate to 0.00001 and the number of iterations to 500. By appling these changes, the results are far away from the real theta.
My data, both X_b and y, are scaled using a StandardScaler.
If I try to print out theta at each iteration, I get the following (these are only few results):
...
[[2.09755838e+297]
[7.26731496e+299]]
[[-3.54990719e+300]
[-1.22992017e+303]]
[[6.00786188e+303]
[ inf]]
[[-inf]
[ nan]]
...
How to solve the problem? Is it because of the function dominium?
Thanks

I've found an error in the code. For the benefit of all the readers, the error was generated by the feature scaling part that isn't reported in the code above.
The initial theta (randomly assigned) had a completely different scale comparing to the dataset and this led to the impossibility to find valid parameters for the regression.
So by using the correct scaled inputs and targets, the function does its job and converges to the values that I know are correct, as reported in my question.
As Kuedsha suggested, I tried to apply a learning schedule in order to reduce the learning rate at each iteration, even if it is not necessary in this specific case. It works, but of course it takes more iterations to converge. I think that potentially this could be a useful thing to do in a random gradient descent algorithm.
Thanks for your support

In my personal experience, this is probably due to the learning rate you are using. If your result goes to infinity this might be because you are using a too big learning rate. Also, be sure to decrease the leaning rate (eta in your code) in each iteration as this will make sure that your solution converges. I am not sure about what would be the optimal way to do it for your particular problem but you could try something like:
eta=initial_eta/(iteration+1)
or
eta=initial_eta/sqrt(iteration+1)
Edit: in fact, as you can see in your results, the value for your parameter goes from negative to positive in each iteration and always increasing in modulus.
I think this is because when you calculate the gradient in the first iteration eta*gradient is so large that is goes to negative value which is higher in modulus. Then, in the second iteration the gradient is even greater and eta*gradient is therefore also greater which gives you a positive number which is also greater in modulus. This continious until you get infinity.
This is the reason why you normally have to be careful when tuning the value for the learning rate and decrease it with the iterations.

Checking Neural Network Gradient with Finite Difference Methods Doesn't Work

After a full week of print statements, dimensional analysis, refactoring, and talking through the code out loud, I can say I'm completely stuck.
The gradients my cost function produces are too far from those produced by finite differences.
I have confirmed my cost function produces correct costs for regularized inputs and not. Here's the cost function:
def nnCost(nn_params, X, y, lambda_, input_layer_size, hidden_layer_size, num_labels):
# reshape parameter/weight vectors to suit network size
Theta1 = np.reshape(nn_params[:hidden_layer_size * (input_layer_size + 1)], (hidden_layer_size, (input_layer_size + 1)))
Theta2 = np.reshape(nn_params[(hidden_layer_size * (input_layer_size+1)):], (num_labels, (hidden_layer_size + 1)))
if lambda_ is None:
lambda_ = 0
# grab number of observations
m = X.shape[0]
# init variables we must return
cost = 0
Theta1_grad = np.zeros(Theta1.shape)
Theta2_grad = np.zeros(Theta2.shape)
# one-hot encode the vector y
y_mtx = pd.get_dummies(y.ravel()).to_numpy()
ones = np.ones((m, 1))
X = np.hstack((ones, X))
# layer 1
a1 = X
z2 = Theta1#a1.T
# layer 2
ones_l2 = np.ones((y.shape[0], 1))
a2 = np.hstack((ones_l2, sigmoid(z2.T)))
z3 = Theta2#a2.T
# layer 3
a3 = sigmoid(z3)
reg_term = (lambda_/(2*m)) * (np.sum(np.sum(np.multiply(Theta1, Theta1))) + np.sum(np.sum(np.multiply(Theta2,Theta2))) - np.subtract((Theta1[:,0].T#Theta1[:,0]),(Theta2[:,0].T#Theta2[:,0])))
cost = (1/m) * np.sum((-np.log(a3).T * (y_mtx) - np.log(1-a3).T * (1-y_mtx))) + reg_term
# BACKPROPAGATION
# δ3 equals the difference between a3 and the y_matrix
d3 = a3 - y_mtx.T
# δ2 equals the product of δ3 and Θ2 (ignoring the Θ2 bias units) multiplied element-wise by the g′() of z2 (computed back in Step 2).
d2 = Theta2[:,1:].T#d3 * sigmoidGradient(z2)
# Δ1 equals the product of δ2 and a1.
Delta1 = d2#a1
Delta1 /= m
# Δ2 equals the product of δ3 and a2.
Delta2 = d3#a2
Delta2 /= m
reg_term1 = (lambda_/m) * np.append(np.zeros((Theta1.shape[0],1)), Theta1[:,1:], axis=1)
reg_term2 = (lambda_/m) * np.append(np.zeros((Theta2.shape[0],1)), Theta2[:,1:], axis=1)
Theta1_grad = Delta1 + reg_term1
Theta2_grad = Delta2 + reg_term2
grad = np.append(Theta1_grad.ravel(), Theta2_grad.ravel())
return cost, grad
Here's the code to check the gradients. I have been over every line and there is nothing whatsoever that I can think of to change here. It seems to be in working order.
def checkNNGradients(lambda_):
"""
Creates a small neural network to check the backpropagation gradients.
Credit: Based on the MATLAB code provided by Dr. Andrew Ng, Stanford Univ.
Input: Regularization parameter, lambda, as int or float.
Output: Analytical gradients produced by backprop code and the numerical gradients (computed
using computeNumericalGradient). These two gradient computations should result in
very similar values.
"""
input_layer_size = 3
hidden_layer_size = 5
num_labels = 3
m = 5
# generate 'random' test data
Theta1 = debugInitializeWeights(hidden_layer_size, input_layer_size)
Theta2 = debugInitializeWeights(num_labels, hidden_layer_size)
# reusing debugInitializeWeights to generate X
X = debugInitializeWeights(m, input_layer_size - 1)
y = np.ones(m) + np.remainder(np.range(m), num_labels)
# unroll parameters
nn_params = np.append(Theta1.ravel(), Theta2.ravel())
costFunc = lambda p: nnCost(p, X, y, lambda_, input_layer_size, hidden_layer_size, num_labels)
cost, grad = costFunc(nn_params)
numgrad = computeNumericalGradient(costFunc, nn_params)
# examine the two gradient computations; two columns should be very similar.
print('The columns below should be very similar.\n')
# Credit: http://stackoverflow.com/a/27663954/583834
print('{:<25}{}'.format('Numerical Gradient', 'Analytical Gradient'))
for numerical, analytical in zip(numgrad, grad):
print('{:<25}{}'.format(numerical, analytical))
# If you have a correct implementation, and assuming you used EPSILON = 0.0001
# in computeNumericalGradient.m, then diff below should be less than 1e-9
diff = np.linalg.norm(numgrad-grad)/np.linalg.norm(numgrad+grad)
print(diff)
print("\n")
print('If your backpropagation implementation is correct, then \n' \
'the relative difference will be small (less than 1e-9). \n' \
'\nRelative Difference: {:.10f}'.format(diff))
The check function generates its own data using a debugInitializeWeights function (so there's the reproducible example; just run that and it will call the other functions), and then calls the function that calculates the gradient using finite differences. Both are below.
def debugInitializeWeights(fan_out, fan_in):
"""
Initializes the weights of a layer with fan_in
incoming connections and fan_out outgoing connections using a fixed
strategy.
Input: fan_out, number of outgoing connections for a layer as int; fan_in, number
of incoming connections for the same layer as int.
Output: Weight matrix, W, of size(1 + fan_in, fan_out), as the first row of W handles the "bias" terms
"""
W = np.zeros((fan_out, 1 + fan_in))
# Initialize W using "sin", this ensures that the values in W are of similar scale;
# this will be useful for debugging
W = np.sin(range(1, np.size(W)+1)) / 10
return W.reshape(fan_out, fan_in+1)
def computeNumericalGradient(J, nn_params):
"""
Computes the gradient using "finite differences"
and provides a numerical estimate of the gradient (i.e.,
gradient of the function J around theta).
Credit: Based on the MATLAB code provided by Dr. Andrew Ng, Stanford Univ.
Inputs: Cost, J, as computed by nnCost function; Parameter vector, theta.
Output: Gradient vector using finite differences. Per Dr. Ng,
'Sets numgrad(i) to (a numerical approximation of) the partial derivative of
J with respect to the i-th input argument, evaluated at theta. (i.e., numgrad(i) should
be the (approximately) the partial derivative of J with respect
to theta(i).)'
"""
numgrad = np.zeros(nn_params.shape)
perturb = np.zeros(nn_params.shape)
e = .0001
for i in range(np.size(nn_params)):
# Set perturbation (i.e., noise) vector
perturb[i] = e
# run cost fxn w/ noise added to and subtracted from parameters theta in nn_params
cost1, grad1 = J((nn_params - perturb))
cost2, grad2 = J((nn_params + perturb))
# record the difference in cost function ouputs; this is the numerical gradient
numgrad[i] = (cost2 - cost1) / (2*e)
perturb[i] = 0
return numgrad
The code is not for class. That MOOC was in MATLAB and it's over. This is for me. Other solutions exist on the web; looking at them has proved fruitless. Everyone has a different (inscrutable) approach. So, I'm in serious need of assistance or a miracle.
Edit/Update: Fortran ordering when raveling vectors influences the outcome, but I have not been able to get the gradients to move together changing that option.

One thought: I think your perturbation is a little large, being 1e-4. For double precision floating point numbers, it should be more like 1e-8, i.e., the root of the machine precision (or are you working with single precision?!).
That being said, finite differences can be very bad approximations to true derivatives. Specifically, floating point computations in numpy are not deterministic, as you seem to have found out. The noise in evaluations can cancel out many significant digits under some circumstances. What values are you seeing and what are you expecting?

All of the following figured into the solution to my problem. For those attempting to translate MATLAB code to Python, whether from Andrew NG's Coursera Machine Learning course or not, these are things everyone should know.
MATLAB does everything in FORTRAN order; Python does everything in C order. This affects how vectors are populated and, thus, your results. You should always be in FORTRAN order, if you want your answers to match what you did in MATLAB. See docs
Getting your vectors in FORTRAN order can be as easy as passing order='F' as an argument to .reshape(), .ravel(), or .flatten(). You may, however, achieve the same thing if you are using .ravel() by transposing the vector then applying the .ravel() function like so X.T.ravel().
Speaking of .ravel(), the .ravel() and .flatten() functions do not do the same thing and may have different use cases. For example, .flatten() is preferred by SciPy optimization methods. So, if your equivalent of fminunc isn't working, it's likely because you forgot to .flatten() your response vector y. See this Q&A StackOverflow and docs on .ravel() which link to .flatten().More Docs
If you're translating your code from MATLAB live script into a Jupyter notebook or Google COLAB, you must police your name space. On one occasion, I found that the variable I thought was being passed was not actually the variable that was being passed. Why? Jupyter and Colab notebooks have a lot of global variables that one would never write ordinarily.
There is a better function to evaluate the differences between numerical and analytical gradients: Relative Error Comparison np.abs(numerical-analyitical)/(numerical+analytical). Read about it here CS231 Also, consider the accepted post above.

Implementing a trainable generalized Bump function layer in Keras/Tensorflow

I'm trying to code the following variant of the Bump function, applied component-wise:
,
where σ is trainable; but it's not working (errors reported below).
My attempt:
Here's what I've coded up so far (if it helps). Suppose I have two functions (for example):
def f_True(x):
# Compute Bump Function
bump_value = 1-tf.math.pow(x,2)
bump_value = -tf.math.pow(bump_value,-1)
bump_value = tf.math.exp(bump_value)
return(bump_value)
def f_False(x):
# Compute Bump Function
x_out = 0*x
return(x_out)
class trainable_bump_layer(tf.keras.layers.Layer):
def __init__(self, *args, **kwargs):
super(trainable_bump_layer, self).__init__(*args, **kwargs)
def build(self, input_shape):
self.threshold_level = self.add_weight(name='threshlevel',
shape=[1],
initializer='GlorotUniform',
trainable=True)
def call(self, input):
# Determine Thresholding Logic
The_Logic = tf.math.less(input,self.threshold_level)
# Apply Logic
output_step_3 = tf.cond(The_Logic,
lambda: f_True(input),
lambda: f_False(input))
return output_step_3
Error Report:
Train on 100 samples
Epoch 1/10
WARNING:tensorflow:Gradients do not exist for variables ['reconfiguration_unit_steps_3_3/threshlevel:0'] when minimizing the loss.
WARNING:tensorflow:Gradients do not exist for variables ['reconfiguration_unit_steps_3_3/threshlevel:0'] when minimizing the loss.
32/100 [========>.....................] - ETA: 3s
...
tensorflow:Gradients do not exist for variables
Moreover, it does not seem to be applied component-wise (besides the non-trainable problem). What could be the problem?

Unfortunately, no operation to check whether x is within (-σ, σ) will be differentiable and therefore σ cannot be learnt via any gradient descent method. Specifically, it is not possible to compute the gradients with respect to self.threshold_level because tf.math.less is not differentiable with respect to the condition.
Regarding the element-wise conditional, you can instead use tf.where to select elements from f_True(input) or f_False(input) according to the component-wise boolean values of the condition. For example:
output_step_3 = tf.where(The_Logic, f_True(input), f_False(input))
NOTE: I answered based on the provided code, where self.threshold_level is not used in f_True nor f_False. If self.threshold_level is used in those functions as in the provided formula, the function will, of course, be differentiable with respect to self.threshold_level.
Updated 19/04/2020: Thank you #today for the clarification.

I suggest you try a normal distribution instead of a bump.
In my tests here, this bump function is not behaving well (I can't find a bug but don't discard it, but my graph shows two very sharp bumps, which is not good for networks)
With a normal distribution, you would get a regular and differentiable bump whose height, width and center you can control.
So, you may try this function:
y = a * exp ( - b * (x - c)²)
Try it in some graph and see how it behaves.
For this:
class trainable_bump_layer(tf.keras.layers.Layer):
def __init__(self, *args, **kwargs):
super(trainable_bump_layer, self).__init__(*args, **kwargs)
def build(self, input_shape):
#suggested shape (has a different kernel for each input feature/channel)
shape = tuple(1 for _ in input_shape[:-1]) + input_shape[-1:]
#for your desired shape of only 1:
shape = tuple(1 for _ in input_shape) #all ones
#height
self.kernel_a = self.add_weight(name='kernel_a ',
shape=shape
initializer='ones',
trainable=True)
#inverse width
self.kernel_b = self.add_weight(name='kernel_b',
shape=shape
initializer='ones',
trainable=True)
#center
self.kernel_c = self.add_weight(name='kernel_c',
shape=shape
initializer='zeros',
trainable=True)
def call(self, input):
exp_arg = - self.kernel_b * K.square(input - self.kernel_c)
return self.kernel_a * K.exp(exp_arg)

I am a bit surprised that no one has mentioned the main (and only) reason for the given warning! As it seems, that code is supposed to implement the generalized variant of Bump function; however, just take a look at the functions implemented again:
def f_True(x):
# Compute Bump Function
bump_value = 1-tf.math.pow(x,2)
bump_value = -tf.math.pow(bump_value,-1)
bump_value = tf.math.exp(bump_value)
return(bump_value)
def f_False(x):
# Compute Bump Function
x_out = 0*x
return(x_out)
The error is evident: there is no usage of the trainable weight of the layer in these functions! So there is no surprise that you get the message saying that no gradient exist for that: you are not using it at all, so no gradient to update it! Rather, this is exactly the original Bump function (i.e. with no trainable weight).
But, you might say that: "at least, I used the trainable weight in the condition of tf.cond, so there must be some gradients?!"; however, it's not like that and let me clear up the confusion:
First of all, as you have noticed as well, we are interested in element-wise conditioning. So instead of tf.cond you need to use tf.where.
The other misconception is to claim that since tf.less is used as the condition, and since it is not differentiable i.e. it has no gradient with respect to its inputs (which is true: there is no defined gradient for a function with boolean output w.r.t. its real-valued inputs!), then that results in the given warning!
That's simply wrong! The derivative here would be taken of the output of the layer w.r.t trainable weight, and the selection condition is NOT present in the output. Rather, it's just a boolean tensor which determines the output branch to be selected. That's it! The derivative of condition is not taken and will never be needed. So that's not the reason for the given warning; the reason is only and only what I mentioned above: no contribution of trainable weight in the output of layer. (Note: if the point about condition is a bit surprising to you, then think about a simple example: the ReLU function, which is defined as relu(x) = 0 if x < 0 else x. If the derivative of condition, i.e. x < 0, is considered/needed, which does not exists, then we would not be able to use ReLU in our models and train them using gradient-based optimization methods at all!)
(Note: starting from here, I would refer to and denote the threshold value as sigma, like in the equation).
All right! We found the reason behind the error in implementation. Could we fix this? Of course! Here is the updated working implementation:
import tensorflow as tf
from tensorflow.keras.initializers import RandomUniform
from tensorflow.keras.constraints import NonNeg
class BumpLayer(tf.keras.layers.Layer):
def __init__(self, *args, **kwargs):
super(BumpLayer, self).__init__(*args, **kwargs)
def build(self, input_shape):
self.sigma = self.add_weight(
name='sigma',
shape=[1],
initializer=RandomUniform(minval=0.0, maxval=0.1),
trainable=True,
constraint=tf.keras.constraints.NonNeg()
)
super().build(input_shape)
def bump_function(self, x):
return tf.math.exp(-self.sigma / (self.sigma - tf.math.pow(x, 2)))
def call(self, inputs):
greater = tf.math.greater(inputs, -self.sigma)
less = tf.math.less(inputs, self.sigma)
condition = tf.logical_and(greater, less)
output = tf.where(
condition,
self.bump_function(inputs),
0.0
)
return output
A few points regarding this implementation:
We have replaced tf.cond with tf.where in order to do element-wise conditioning.
Further, as you can see, unlike your implementation which only checked for one side of inequality, we are using tf.math.less, tf.math.greater and also tf.logical_and to find out whether the input values have magnitudes of less than sigma (alternatively, we could do this using just tf.math.abs and tf.math.less; no difference!). And let us repeat it: using boolean-output functions in this way does not cause any problems and have nothing to do with derivatives/gradients.
We are also using a non-negativity constraint on the sigma value learned by layer. Why? Because sigma values less than zero does not make sense (i.e. the range (-sigma, sigma) is ill-defined when sigma is negative).
And considering the previous point, we take care to initialize the sigma value properly (i.e. to a small non-negative value).
And also, please don't do things like 0.0 * inputs! It's redundant (and a bit weird) and it is equivalent to 0.0; and both have a gradient of 0.0 (w.r.t. inputs). Multiplying zero with a tensor does not add anything or solve any existing issue, at least not in this case!
Now, let's test it to see how it works. We write some helper functions to generate training data based on a fixed sigma value, and also to create a model which contains a single BumpLayer with input shape of (1,). Let's see if it could learn the sigma value which is used for generating training data:
import numpy as np
def generate_data(sigma, min_x=-1, max_x=1, shape=(100000,1)):
assert sigma >= 0, 'Sigma should be non-negative!'
x = np.random.uniform(min_x, max_x, size=shape)
xp2 = np.power(x, 2)
condition = np.logical_and(x < sigma, x > -sigma)
y = np.where(condition, np.exp(-sigma / (sigma - xp2)), 0.0)
dy = np.where(condition, xp2 * y / np.power((sigma - xp2), 2), 0)
return x, y, dy
def make_model(input_shape=(1,)):
model = tf.keras.Sequential()
model.add(BumpLayer(input_shape=input_shape))
model.compile(loss='mse', optimizer='adam')
return model
# Generate training data using a fixed sigma value.
sigma = 0.5
x, y, _ = generate_data(sigma=sigma, min_x=-0.1, max_x=0.1)
model = make_model()
# Store initial value of sigma, so that it could be compared after training.
sigma_before = model.layers[0].get_weights()[0][0]
model.fit(x, y, epochs=5)
print('Sigma before training:', sigma_before)
print('Sigma after training:', model.layers[0].get_weights()[0][0])
print('Sigma used for generating data:', sigma)
# Sigma before training: 0.08271004
# Sigma after training: 0.5000002
# Sigma used for generating data: 0.5
Yes, it could learn the value of sigma used for generating data! But, is it guaranteed that it actually works for all different values of training data and initialization of sigma? The answer is: NO! Actually, it is possible that you run the code above and get nan as the value of sigma after training, or inf as the loss value! So what's the problem? Why this nan or inf values might be produced? Let's discuss it below...
Dealing with numerical stability
One of the important things to consider, when building a machine learning model and using gradient-based optimization methods to train them, is the numerical stability of operations and calculations in a model. When extremely large or small values are generated by an operation or its gradient, almost certainly it would disrupt the training process (for example, that's one of the reasons behind normalizing image pixel values in CNNs to prevent this issue).
So, let's take a look at this generalized bump function (and let's discard the thresholdeding for now). It's obvious that this function has singularities (i.e. points where either the function or its gradient is not defined) at x^2 = sigma (i.e. when x = sqrt(sigma) or x=-sqrt(sigma)). The animated diagram below shows the bump function (the solid red line), its derivative w.r.t. sigma (the dotted green line) and x=sigma and x=-sigma lines (two vertical dashed blue lines), when sigma starts from zero and is increased to 5:
As you can see, around the region of singularities the function is not well-behaved for all values of sigma, in the sense that both the function and its derivative take extremely large values at those regions. So given an input value at those regions for a particular value of sigma, exploding output and gradient values would be generated, hence the issue of inf loss value.
Even further, there is a problematic behavior of tf.where which causes the issue of nan values for the sigma variable in the layer: surprisingly, if the produced value in inactive branch of tf.where is extremely large or inf, which with the bump function results in extremely large or inf gradient values, then the gradient of tf.where would be nan, despite the fact that the inf is in inactive branch and is not even selected (see this Github issue which discusses exactly this)!!
So is there any workaround for this behavior of tf.where? Yes, actually there is a trick to somehow resolve this issue which is explained in this answer: basically we can use an additional tf.where in order to prevent the function to be applied on these regions. In other words, instead of applying self.bump_function on any input value, we filter those values which are NOT in the range (-self.sigma, self.sigma) (i.e. the actual range which the function should be applied) and instead feed the function with zero (which is always produce safe values, i.e. is equal to exp(-1)):
output = tf.where(
condition,
self.bump_function(tf.where(condition, inputs, 0.0)),
0.0
)
Applying this fix would entirely resolve the issue of nan values for sigma. Let's evaluate it on training data values generated with different sigma values and see how it would perform:
true_learned_sigma = []
for s in np.arange(0.1, 10.0, 0.1):
model = make_model()
x, y, dy = generate_data(sigma=s, shape=(100000,1))
model.fit(x, y, epochs=3 if s < 1 else (5 if s < 5 else 10), verbose=False)
sigma = model.layers[0].get_weights()[0][0]
true_learned_sigma.append([s, sigma])
print(s, sigma)
# Check if the learned values of sigma
# are actually close to true values of sigma, for all the experiments.
res = np.array(true_learned_sigma)
print(np.allclose(res[:,0], res[:,1], atol=1e-2))
# True
It could learn all the sigma values correctly! That's nice. That workaround worked! Although, there is one caveat: this is guaranteed to work properly and learn any sigma value if the input values to this layer are greater than -1 and less than 1 (i.e. this is the default case of our generate_data function); otherwise, there is still the issue of inf loss value which might happen if the input values have a magnitude of greater than 1 (see point #1 and #2, below).
Here are some foods for thought for the curios and interested mind:
It was just mentioned that if the input values to this layer are greater than 1 or less than -1, then it may cause problems. Can you argue why this is the case? (Hint: use the animated diagram above and consider cases where sigma > 1 and the input value is between sqrt(sigma) and sigma (or between -sigma and -sqrt(sigma).)
Can you provide a fix for the issue in point #1, i.e. such that the layer could work for all input values? (Hint: like the workaround for tf.where, think about how you can further filter-out the unsafe values which the bump function could be applied on and produce exploding output/gradient.)
However, if you are not interested to fix this issue, and would like to use this layer in a model as it is now, then how would you guarantee that the input values to this layer are always between -1 and 1? (Hint: as one solution, there is a commonly-used activation function which produces values exactly in this range and could be potentially used as the activation function of the layer which is before this layer.)
If you take a look at the last code snippet, you will see that we have used epochs=3 if s < 1 else (5 if s < 5 else 10). Why is that? Why large values of sigma need more epochs to be learned? (Hint: again, use the animated diagram and consider the derivative of function for input values between -1 and 1 as sigma value increases. What are their magnitude?)
Do we also need to check the generated training data for any nan, inf or extremely large values of y and filter them out? (Hint: yes, if sigma > 1 and range of values, i.e. min_x and max_x, fall outside of (-1, 1); otherwise, no that's not necessary! Why is that? Left as an exercise!)

gradient descent for linear regression in python code

def computeCost(X, y, theta):
inner = np.power(((X * theta.T) - y), 2)
return np.sum(inner) / (2 * len(X))
def gradientDescent(X, y, theta, alpha, iters):
temp = np.matrix(np.zeros(theta.shape))
params = int(theta.ravel().shape[1]) #flattens
cost = np.zeros(iters)
for i in range(iters):
err = (X * theta.T) - y
for j in range(params):
term = np.multiply(err, X[:,j])
temp[0, j] = theta[0, j] - ((alpha / len(X)) * np.sum(term))
theta = temp
cost[i] = computeCost(X, y, theta)
return theta, cost
Here is the code for linear regression cost function and gradient descent that I've found on a tutorial, but I am not quite sure how it works.
First I get how the computeCost code works since it's just (1/2M) where M is number of data.
For gradientDescent code, I just don't understand how it works in general. I know the formula for updating the theta is something like
theta = theta - (learningRate) * derivative of J(cost function). But I am not sure where alpha / len(X)) * np.sum(term) this comes from on the line updating temp[0,j].
Please help me to understand!

I'll break this down for you. So in your gradientDescent function, you are taking in the predictor variables(X), target variable(y),
the weight matrix(theta) and two more parameters(alpha, iters) which are the training parameters. The function's job is to figure out how much should
each column in the predictor variables(X) set should be multiplied by before adding to get the predicted values of target variable(y).
In the first line of the function you are initiating a weight matrix called temp to zeros. This basically is the starting point of the
final weight matrix(theta) that the function will output in the end. The params variable is basically the number of weights or number of predictor variables.
This line makes more sense in the context of neural networks where you create abstraction of features.
In linear regression, the weight matrix will mostly be a one dimensional array. In a typical feed forward neural networks we take in a lets say 5 features,
convert it to maybe 4 or 6 or n number of features by using a weight matrix and so on. In linear regression we simply distill all the input
features into one output feature. So theta will essentially be equivalent to a row vector(which is evident by the line temp[0,j]=...) and params will be the number of features. We cost array just stores
that array at each iteration. Now on to the two loops. In the line err = (X * theta.T) - y under the first for loop, we are calculating the prediction
error for each training example. The shape of the err variable be number of examples,1. In the second for loop we are training the model,
that is we are gradually updating our term matrix. We run the second for loop iters
number of times. In Neural network context, we generally call it epochs. It is basically the number of times you want to train the model.
Now the line term = np.multiply(err, X[:,j]): here we are calculating the individual adjustment that should be made to each weight in the temp matrix.
We define the cost as (y_predicted-y_actual)**2/number_of_training_points, where y_predicted = X_1*w_1 + X_2*w_2 + X_3*w_3 +... If we differentiate this cost vwith respect to
a particular weight(say W_i) we get (y_predicted-y_actual)*(X_i)/number_of_training_points where X_i is the column that W_i multiplies with. So the term = line
basically computes that differentiation part. We can multiply the term variable with the learning rate and subtract from W_i.
But as you might notice the term variable is an array. But that is taken care of in the next line.
In the next line we take the average of the term (by summing it up and then dividing it by len(X))
and subtract it from the corresponding weight in the temp matrix. Once the weights have been updated and stored and the temp matrix,
we replace the original theta by temp. We repeat this process iters number of times

If you , instead of writing it like ((alpha / len(X)) * np.sum(term)), write it like (alpha * (np.sum(term) / len(X))) ,which you're allowed to do since multiplication and division are comutative (if i remember the term correctly) then you just multiply alpha by the average error, since term is length X anyway.
This means that you substract the learning rate (alpha) times average error which would be something like X[j] * tetha (actual) - y (ideal) , which incidentally is also close enough to the derivative of (X*tetha - y)^2

How to determine the learning rate and the variance in a gradient descent algorithm？

I started to learn the machine learning last week. when I want to make a gradient descent script to estimate the model parameters, I came across a problem: How to choose a appropriate learning rate and variance。I found that，different (learning rate，variance) pairs may lead to different results, some times you even can't convergence. Also, if change to another training data set, a well-chose （learning rate，variance）pair probably will not work. For example(script below)，when I set the learning rate to 0.001 and variance to 0.00001, for 'data1', I can get the suitable theta0_guess and theta1_guess. But for ‘data2’, they can't make the algorithem convergence, even when I tried dozens of （learning rate，variance）pairs still can't reach to convergence.
So if anybody could tell me that are there some criteria or methods to determine the （learning rate，variance）pair.
import sys
data1 = [(0.000000,95.364693) ,
(1.000000,97.217205) ,
(2.000000,75.195834),
(3.000000,60.105519) ,
(4.000000,49.342380),
(5.000000,37.400286),
(6.000000,51.057128),
(7.000000,25.500619),
(8.000000,5.259608),
(9.000000,0.639151),
(10.000000,-9.409936),
(11.000000, -4.383926),
(12.000000,-22.858197),
(13.000000,-37.758333),
(14.000000,-45.606221)]
data2 = [(2104.,400.),
(1600.,330.),
(2400.,369.),
(1416.,232.),
(3000.,540.)]
def create_hypothesis(theta1, theta0):
return lambda x: theta1*x + theta0
def linear_regression(data, learning_rate=0.001, variance=0.00001):
theta0_guess = 1.
theta1_guess = 1.
theta0_last = 100.
theta1_last = 100.
m = len(data)
while (abs(theta1_guess-theta1_last) > variance or abs(theta0_guess - theta0_last) > variance):
theta1_last = theta1_guess
theta0_last = theta0_guess
hypothesis = create_hypothesis(theta1_guess, theta0_guess)
theta0_guess = theta0_guess - learning_rate * (1./m) * sum([hypothesis(point[0]) - point[1] for point in data])
theta1_guess = theta1_guess - learning_rate * (1./m) * sum([ (hypothesis(point[0]) - point[1]) * point[0] for point in data])
return ( theta0_guess,theta1_guess )
points = [(float(x),float(y)) for (x,y) in data1]
res = linear_regression(points)
print res

Plotting is the best way to see how your algorithm is performing. To see if you have achieved convergence you can plot the evolution of the cost function after each iteration, after a certain given of iteration you will see that it does not improve much you can assume convergence, take a look to the following code:
cost_f = []
while (abs(theta1_guess-theta1_last) > variance or abs(theta0_guess - theta0_last) > variance):
theta1_last = theta1_guess
theta0_last = theta0_guess
hypothesis = create_hypothesis(theta1_guess, theta0_guess)
cost_f.append((1./(2*m))*sum([ pow(hypothesis(point[0]) - point[1], 2) for point in data]))
theta0_guess = theta0_guess - learning_rate * (1./m) * sum([hypothesis(point[0]) - point[1] for point in data])
theta1_guess = theta1_guess - learning_rate * (1./m) * sum([ (hypothesis(point[0]) - point[1]) * point[0] for point in data])
import pylab
pylab.plot(range(len(cost_f)), cost_f)
pylab.show()
Which will plot the following graphic (execution with learning_rate=0.01, variance=0.00001)
As you can see, after a thousand iteration you don't get much improvement. I normally declare convergence if the cost function decreases less than 0.001 in one iteration, but this just based on my own experience.
For choosing learning rate, the best thing you can do is also plot the cost function and see how it is performing, and always remember these two things:
if the learning rate is too small you will get slow convergence
if the learning rate is too large your cost function may not decrease in every iteration and therefore it will not converge
If you run your code choosing learning_rate > 0.029 and variance=0.001 you will be in the second case, gradient descent doesn't converge, while if you choose values learning_rate < 0.0001, variance=0.001 you will see that your algorithm takes a lot iteration to converge.
Not convergence example with learning_rate=0.03
Slow convergence example with learning_rate=0.0001

There are a bunch of ways to guarantee convergence of a gradient descent algorithm. There is line search, a fixed step size related to the Lipschitz constant of the gradient (that is in case of a function. In case of a table such as yours, you can make the difference between consecutive values), a decreasing step size for each iteration and some others. Some of them can be found here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

The initial Gradient in Gradient Descent is abysmal and wrong - python

Related

Gradient descent function always returns a parameter vector with Nan values

Checking Neural Network Gradient with Finite Difference Methods Doesn't Work

Implementing a trainable generalized Bump function layer in Keras/Tensorflow

gradient descent for linear regression in python code

How to determine the learning rate and the variance in a gradient descent algorithm？

Categories

Resources