Im studying Gradient Descent, at this code precision meaning is what? - python

I'm studying Gradient Descent by myself.
Because using to resume for university admission.
Is the meaning of precision an allowable value of error?
x_old = 0
x_new = 6 # The algorithm starts at x=6
eps = 0.01 # step size
precision = 0.00001
def f_prime(x):
return 4 * x**3 - 9 * x**2
while abs(x_new - x_old) > precision:
x_old = x_new
x_new = x_old - eps * f_prime(x_old)
print("Local minimum occurs at: " + str(x_new))

It seems that precision is a means to check for convergence: if the last iteration of gradient descent caused only a small change, then stop.
This approach is not very robust. First, a small change in a single iteration is not a strong indication for convergence. It would be better to look for a small change in several consecutive iterations. Second, the process might not converge at all, so some sort of guard against an infinite loop should be used.

Related

The initial Gradient in Gradient Descent is abysmal and wrong

I am building in Python using IDLE 3.9 an input optimiser, which optimises a certain input "thetas" for a certain target output "targetRes" versus a calculated output obtained from running a model with input "thetas" in a solver. The way it works is that a model is first defined with a function called FEA(thetas, fem). After running the solver, FEA returns the output.
The chosen optimisation algorithm is Gradient Descent. FEA (output) is taken as the hypothesis function, and the target output is subtracted from it. The result is then squared to give the loss function. The gradient of the loss function is then determined using FDM. The update step then takes place. For now, I am only running the algorithm on thetas[0]. Below is the GD code:
targetRes = -0.1
thetas = [1000., 1., 1., 1., 1.] # input here initial value of unknown vector theta
def LF(thetas):
return (FEA(thetas, fem) - targetRes) ** 2 / 2
def FDM(thetas, LF):
fdm = []
for i, theta in enumerate(thetas):
h = 0.1
if i == 0:
print(h)
thetas_p_h = []
for t in thetas:
thetas_p_h.append(t)
thetas_p_h[i] += h
thetas_m_h = []
for t in thetas:
thetas_m_h.append(t)
thetas_m_h[i] -= h
grad = (LF(thetas_p_h) - LF(thetas_m_h)) / (2 * h)
fdm.append(grad)
return fdm
def GD(thetas, LF):
tol = 0.000001
alpha = 10000000
Nmax = 1000
for n in range(Nmax):
gradient = FDM(thetas, LF)
thetas_new = []
for gradient_item, theta_item in zip(gradient, thetas):
t_new = theta_item - (alpha * gradient_item)
thetas_new.append(t_new)
print(thetas, f'gradient = {gradient}', LF(thetas), thetas_new, LF(thetas_new))
if tol >= abs(LF(thetas_new) - LF(thetas)):
print(f"solution converged in {n} iterations, theta = {thetas_new}, LF(theta) = {LF(thetas_new)}, FEA(theta) = {FEA(thetas_new, fem)}")
return thetas_new
thetas = thetas_new
else:
print(f"reached max iterations, Nmax = {Nmax}")
return None
GD(thetas, LF)
As you can see the gradient descent algorithm I am using is different to the linear regression type, and that is because there are no features to evaluate, just labels (y). Unfortunately I am not allowed to provide the solver code and most likely not allowed to provide the model code as well, defined in FEA.
In the current example, the calculated initial output is -0.070309. My issues are:
The gradient is very minute at the first update iteration, and it is 2.078369999999961e-06, and the final update iteration gradient value is 1.834250000000102e-08. In fact the first update iteration gradient value is most likely wrong, as I attempted to calculate it by hand and I got a value of somewhere around 0.000005, which is still tiny.
I am using a gigantic learning rate value, because the algorithm is horribly slow without such.
The algorithm converges to target output only with certain input values. There was another case where I had other values of thetas and the algorithm at zeroth iteration produces a loss function of order of 10^7, and at the first iteration it drops right down to zero and converges, which is unrealistic. In both cases I use a large learning rate, and both cases converge to the target output. In some other cases I have other inputs and the algorithm does not converge to the target output. Other cases lead to a negative value of thetas[0] which causes the solver to fail.
Also dismiss the fact that I am not using any external libraries; it is intended. So of course without trying to run the code, any observations? Does anyone see anything obvious that I'm missing here? As I think this could be the issue. Could it be due to the orders of magnitude of the inputs? What are your thoughts? (there are no issues with the solver or model function named FEA, both have been repeatedly verified to work perfectly fine).

How do I find the percentage error in a Monte Carlo algorithm?

I have written a Monte Carlo program to integrate a function f(x).
I have now been asked to calculate the percentage error.
Having done a quick literature search, I found that this can be given with the equation %error = (sqrt(var[f(x)]/n))*100, where n is the number of random points I used to derive my answer.
However, when I run my integration code, my percentage error is greater than that given by this formula.
Do I have the correct formula?
Any help would be greatly appreciated. Thanks x
Here is quick example - estimate integral of linear function on the interval [0...1] using Monte-Carlo. To estimate error you have to collect second momentum (values squared), then compute variance, standard deviation, and (assuming CLT), error of the simulation in the original units as well as in %
Code, Python 3.7, Anaconda, Win10 64x
import numpy as np
def f(x): # linear function to integrate
return x
np.random.seed(312345)
N = 100000
x = np.random.random(N)
q = f(x) # first momentum
q2 = q*q # second momentum
mean = np.sum(q) / float(N) # compute mean explicitly, not using np.mean
var = np.sum(q2) / float(N) - mean * mean # variance as E[X^2] - E[X]^2
sd = np.sqrt(var) # std.deviation
print(mean) # should be 1/2
print(var) # should be 1/12
print(sd) # should be 0.5/sqrt(3)
print("-----------------------------------------------------")
sigma = sd / np.sqrt(float(N)) # assuming CLT, error estimation in original units
print("result = {0} with error +- {1}".format(mean, sigma))
err_pct = sigma / mean * 100.0 # error estimate in percents
print("result = {0} with error +- {1}%".format(mean, err_pct))
Be aware, that we computed one sigma error and (even not talking about it being random value itself) true result is within printed mean+-error only for 68% of the runs. You could print mean+-2*error, and it would mean true result is inside that region for 95% cases, mean+-3*error true result is inside that region for 99.7% of the runs and so on and so forth.
UPDATE
For sampling variance estimate, there is known problem called Bias in the estimator. Basically, we underestimate a bit sampling variance, proper correction (Bessel's correction) shall be applied
var = np.sum(q2) / float(N) - mean * mean # variance as E[X^2] - E[X]^2
var *= float(N)/float(N-1)
In many cases (and many examples) it is omitted because N is very large, which makes correction pretty much invisible - f.e., if you have statistical error 1% but N is in millions, correction is of no practical use.

Why is my code using 4th Runge-Kutta isn't giving me the expected values?

I'm having a little trouble trying to understand what's wrong with me code, any help would be extremely helpful.
I wanted to solve this simple equation
However, the values my code gives doesn't match with my book ones or wolfram ones as y goes up as x grows.
import matplotlib.pyplot as plt
from numpy import exp
from scipy.integrate import ode
# initial values
y0, t0 = [1.0], 0.0
def f(t, y):
f = [3.0*y[0] - 4.0/exp(t)]
return f
# initialize the 4th order Runge-Kutta solver
r = ode(f).set_integrator('dopri5')
r.set_initial_value(y0, t0)
t1 = 10
dt = 0.1
x, y = [], []
while r.successful() and r.t < t1:
x.append(r.t+dt); y.append(r.integrate(r.t+dt))
print(r.t+dt, r.integrate(r.t+dt))
Your equation in general has the solution
y(x) = (y0-1)*exp(3*x) + exp(-x)
Due to the choice of initial conditions, the exact solution does not contain the growing component of the first term. However, small perturbations due to discretization and floating point errors will generate a non-zero coefficient in the growing term. Now at the end of the integration interval this random coefficient is multiplied by exp(3*10)=1.107e+13 which will magnify small discretization errors of size 1e-7 to contributions in the result of size 1e+6 as observed when running the original code.
You can force the integrator to be more precise in its internal steps without reducing the output step size dt by setting error thresholds like in
r = ode(f).set_integrator('dopri5', atol=1e-16, rtol=1e-20)
However, you can not avoid the deterioration of the result completely as the floating point errors of size 1e-16 get magnified to global error contributions of size 1e-3.
Also, you should notice that each call of r.integrate(r.t+dt) will advance the integrator by dt so that the stored array and the printed values are in lock-step. If you want to just print the current state of the integrator use
print(r.t,r.y,yexact(r.t,y0))
where the last is to compare to the exact solution which is, as already said,
def yexact(x,y0):
return [ (y0[0]-1)*exp(3*x)+exp(-x) ]

Batch Gradient Descent for Logistic Regression

I've been following Andrew Ng CSC229 machine learning course, and am now covering logistic regression. The goal is to maximize the log likelihood function and find the optimal values of theta to do so. The link to the lecture notes is: [http://cs229.stanford.edu/notes/cs229-notes1.ps][1] -pages 16-19. Now the code below was shown on the course homepage (in matlab though--I converted it to python).
I'm applying it to a data set with 100 training examples (a data set given on the Coursera homepage for a introductory machine learning course). The data has two features which are two scores on two exams. The output is a 1 if the student received admission and 0 is the student did not receive admission. The have shown all of the code below. The following code causes the likelihood function to converge to maximum of about -62. The corresponding values of theta are [-0.05560301 0.01081111 0.00088362]. Using these values when I test out a training example like [1, 30.28671077, 43.89499752] which should give a value of 0 as output, I obtain 0.576 which makes no sense to me. If I test the hypothesis function with input [1, 10, 10] I obtain 0.515 which once again makes no sense. These values should correspond to a lower probability. This has me quite confused.
import numpy as np
import sig as s
def batchlogreg(X, y):
max_iterations = 800
alpha = 0.00001
(m,n) = np.shape(X)
X = np.insert(X, 0, 1, 1)
theta = np.array([0] * (n+1), 'float')
ll = np.array([0] * max_iterations, 'float')
for i in range(max_iterations):
hx = s.sigmoid(np.dot(X, theta))
d = y - hx
theta = theta + alpha*np.dot(np.transpose(X),d)
ll[i] = sum(y * np.log(hx) + (1-y) * np.log(1- hx))
return (theta, ll)
Note that the sigmoid function has:
sig(0) = 0.5
sig(x > 0) > 0.5
sig(x < 0) < 0.5
Since you get all probabilities above 0.5, this suggests that you never make X * theta negative, or that you do, but your learning rate is too small to make it matter.
for i in range(max_iterations):
hx = s.sigmoid(np.dot(X, theta)) # this will probably be > 0.5 initially
d = y - hx # then this will be "very" negative when y is 0
theta = theta + alpha*np.dot(np.transpose(X),d) # (1)
ll[i] = sum(y * np.log(hx) + (1-y) * np.log(1- hx))
The problem is most likely at (1). The dot product will be very negative, but your alpha is very small and will negate its effect. So theta will never decrease enough to properly handle correctly classifying labels that are 0.
Positive instances are then only barely correctly classified for the same reason: your algorithm does not discover a reasonable hypothesis under your number of iterations and learning rate.
Possible solution: increase alpha and / or the number of iterations, or use momentum.
It sounds like you could be confusing probabilities with assignments.
The probability will be a real number between 0.0 and 1.0. A label will be an integer (0 or 1). Logistic regression is a model that provides the probability of a label being 1 given the input features. To obtain a label value, you need to make a decision using that probability. An easy decision rule is that the label is 0 if the probability is less than 0.5, and 1 if the probability is greater than or equal to 0.5.
So, for the example you gave, the decisions would both be 1 (which means the model is wrong for the first example where it should be 0).
I came to the same question and found the reason.
Normalize X first or set a scale-comparable intercept like 50.
Otherwise contours of cost function are too "narrow". A big alpha makes it overshoot and a small alpha fails to progress.

How to determine the learning rate and the variance in a gradient descent algorithm?

I started to learn the machine learning last week. when I want to make a gradient descent script to estimate the model parameters, I came across a problem: How to choose a appropriate learning rate and variance。I found that,different (learning rate,variance) pairs may lead to different results, some times you even can't convergence. Also, if change to another training data set, a well-chose (learning rate,variance)pair probably will not work. For example(script below),when I set the learning rate to 0.001 and variance to 0.00001, for 'data1', I can get the suitable theta0_guess and theta1_guess. But for ‘data2’, they can't make the algorithem convergence, even when I tried dozens of (learning rate,variance)pairs still can't reach to convergence.
So if anybody could tell me that are there some criteria or methods to determine the (learning rate,variance)pair.
import sys
data1 = [(0.000000,95.364693) ,
(1.000000,97.217205) ,
(2.000000,75.195834),
(3.000000,60.105519) ,
(4.000000,49.342380),
(5.000000,37.400286),
(6.000000,51.057128),
(7.000000,25.500619),
(8.000000,5.259608),
(9.000000,0.639151),
(10.000000,-9.409936),
(11.000000, -4.383926),
(12.000000,-22.858197),
(13.000000,-37.758333),
(14.000000,-45.606221)]
data2 = [(2104.,400.),
(1600.,330.),
(2400.,369.),
(1416.,232.),
(3000.,540.)]
def create_hypothesis(theta1, theta0):
return lambda x: theta1*x + theta0
def linear_regression(data, learning_rate=0.001, variance=0.00001):
theta0_guess = 1.
theta1_guess = 1.
theta0_last = 100.
theta1_last = 100.
m = len(data)
while (abs(theta1_guess-theta1_last) > variance or abs(theta0_guess - theta0_last) > variance):
theta1_last = theta1_guess
theta0_last = theta0_guess
hypothesis = create_hypothesis(theta1_guess, theta0_guess)
theta0_guess = theta0_guess - learning_rate * (1./m) * sum([hypothesis(point[0]) - point[1] for point in data])
theta1_guess = theta1_guess - learning_rate * (1./m) * sum([ (hypothesis(point[0]) - point[1]) * point[0] for point in data])
return ( theta0_guess,theta1_guess )
points = [(float(x),float(y)) for (x,y) in data1]
res = linear_regression(points)
print res
Plotting is the best way to see how your algorithm is performing. To see if you have achieved convergence you can plot the evolution of the cost function after each iteration, after a certain given of iteration you will see that it does not improve much you can assume convergence, take a look to the following code:
cost_f = []
while (abs(theta1_guess-theta1_last) > variance or abs(theta0_guess - theta0_last) > variance):
theta1_last = theta1_guess
theta0_last = theta0_guess
hypothesis = create_hypothesis(theta1_guess, theta0_guess)
cost_f.append((1./(2*m))*sum([ pow(hypothesis(point[0]) - point[1], 2) for point in data]))
theta0_guess = theta0_guess - learning_rate * (1./m) * sum([hypothesis(point[0]) - point[1] for point in data])
theta1_guess = theta1_guess - learning_rate * (1./m) * sum([ (hypothesis(point[0]) - point[1]) * point[0] for point in data])
import pylab
pylab.plot(range(len(cost_f)), cost_f)
pylab.show()
Which will plot the following graphic (execution with learning_rate=0.01, variance=0.00001)
As you can see, after a thousand iteration you don't get much improvement. I normally declare convergence if the cost function decreases less than 0.001 in one iteration, but this just based on my own experience.
For choosing learning rate, the best thing you can do is also plot the cost function and see how it is performing, and always remember these two things:
if the learning rate is too small you will get slow convergence
if the learning rate is too large your cost function may not decrease in every iteration and therefore it will not converge
If you run your code choosing learning_rate > 0.029 and variance=0.001 you will be in the second case, gradient descent doesn't converge, while if you choose values learning_rate < 0.0001, variance=0.001 you will see that your algorithm takes a lot iteration to converge.
Not convergence example with learning_rate=0.03
Slow convergence example with learning_rate=0.0001
There are a bunch of ways to guarantee convergence of a gradient descent algorithm. There is line search, a fixed step size related to the Lipschitz constant of the gradient (that is in case of a function. In case of a table such as yours, you can make the difference between consecutive values), a decreasing step size for each iteration and some others. Some of them can be found here.

Categories

Resources