I'm trying to implement a very naive gradient descent in python. However, it looks like it goes into an infinite loop. Could you please help me debug it?
y = lambda x : x**2
dy_dx = lambda x : 2*x
def gradient_descent(function,derivative,initial_guess):
optimum = initial_guess
while derivative(optimum) != 0:
optimum = optimum - derivative(optimum)
else:
return optimum
gradient_descent(y,dy_dx,5)
Edit:
Now I have this code, I really can't comprehend the output. P.s. It might freeze your CPU.
y = lambda x : x**2
dy_dx = lambda x : 2*x
def gradient_descent(function,derivative,initial_guess):
optimum = initial_guess
while abs(derivative(optimum)) > 0.01:
optimum = optimum - 2*derivative(optimum)
print((optimum,derivative(optimum)))
else:
return optimum
gradient_descent(y,dy_dx,5)
Now I'm trying to apply it to a regression problem, however the output doesn't appear to be correct as shown in the output below:
Output of gradient descent code below
import matplotlib.pyplot as plt
def stepGradient(x,y, step):
b_current = 0
m_current = 0
b_gradient = 0
m_gradient = 0
N = int(len(x))
for i in range(0, N):
b_gradient += -(1/N) * (y[i] - ((m_current*x[i]) + b_current))
m_gradient += -(1/N) * x[i] * (y[i] - ((m_current * x[i]) + b_current))
while abs(b_gradient) > 0.01 and abs(m_gradient) > 0.01:
b_current = b_current - (step * b_gradient)
m_current = m_current - (step * m_gradient)
for i in range(0, N):
b_gradient += -(1/N) * (y[i] - ((m_current*x[i]) + b_current))
m_gradient += -(1/N) * x[i] * (y[i] - ((m_current * x[i]) + b_current))
return [b_current, m_current]
x = [1,2, 2,3,4,5,7,8]
y = [1.5,3,1,3,2,5,6,7]
step = 0.00001
(b,m) = stepGradient(x,y,step)
plt.scatter(x,y)
abline_values = [m * i + b for i in x]
plt.plot(x, abline_values, 'b')
plt.show()
Fixed :D
import matplotlib.pyplot as plt
def stepGradient(x,y):
step = 0.001
b_current = 0
m_current = 0
b_gradient = 0
m_gradient = 0
N = int(len(x))
for i in range(0, N):
b_gradient += -(1/N) * (y[i] - ((m_current*x[i]) + b_current))
m_gradient += -(1/N) * x[i] * (y[i] - ((m_current * x[i]) + b_current))
while abs(b_gradient) > 0.01 or abs(m_gradient) > 0.01:
b_current = b_current - (step * b_gradient)
m_current = m_current - (step * m_gradient)
b_gradient= 0
m_gradient = 0
for i in range(0, N):
b_gradient += -(1/N) * (y[i] - ((m_current*x[i]) + b_current))
m_gradient += -(1/N) * x[i] * (y[i] - ((m_current * x[i]) + b_current))
return [b_current, m_current]
x = [1,2, 2,3,4,5,7,8,10]
y = [1.5,3,1,3,2,5,6,7,20]
(b,m) = stepGradient(x,y)
plt.scatter(x,y)
abline_values = [m * i + b for i in x]
plt.plot(x, abline_values, 'b')
plt.show()
Your while loop stops only when a calculated floating-point value equals zero. This is naïve, since floating-point values are rarely calculated exactly. Instead, stop the loop when the calculated value is close enough to zero. Use something like
while math.abs(derivative(optimum)) > eps:
where eps is the desired precision of the calculated value. This could be made another parameter, perhaps with a default value of 1e-10 or some such.
That said, the problem in your case is worse. Your algorithm is far too naïve in assuming that the calculation
optimum = optimum - 2*derivative(optimum)
will move the value of optimum closer to the actual optimum value. In your particular case, the variable optimum just cycles back and forth between 5 (your initial guess) and -5. Note that the derivative at 5 is 10 and the derivative at -5 is -10.
So you need to avoid such cycling. You could multiply your delta 2*derivative(optimum) by something smaller than 1, which would work in your particular case y=x**2. But this will not work in general.
To be completely safe, 'bracket' your optimum point with a smaller value and a larger value, and use the derivative to find the next guess. But ensure that your next guess does not go outside the bracketed interval. If it does, or if the convergence of your guesses is too slow, use another method such as bisection or golden mean search.
Of course, this means your 'very naïve gradient descent' algorithm is too naïve to work in general. That's why real optimization routines are more complicated.
You also need to decrease your step size (gamma in the gradient descent formula):
y = lambda x : x**2
dy_dx = lambda x : 2*x
def gradient_descent(function,derivative,initial_guess):
optimum = initial_guess
while abs(derivative(optimum)) > 0.01:
optimum = optimum - 0.01*derivative(optimum)
print((optimum,derivative(optimum)))
else:
return optimum
Related
I'm trying to calculate sin(x) using Taylor series without using factorials.
import math, time
import matplotlib.pyplot as plot
def sin3(x, i=30):
x %= 2 * math.pi
n = 0
dn = x**2 / 2
for c in range(4, 2 * i + 4, 2):
n += dn
dn *= -x**2 / ((c + 1) * (c + 2))
return x - n
def draw_graph(start = -800, end = 800):
y = [sin3(i/100) for i in range(start, end)]
x = [i/100 for i in range(start, end)]
y2 = [math.sin(i/100) for i in range(start, end)]
x2 = [i/100 for i in range(start, end)]
plot.fill_between(x, y, facecolor="none", edgecolor="red", lw=0.7)
plot.fill_between(x2, y2, facecolor="none", edgecolor="blue", lw=0.7)
plot.show()
When you run the draw_graph function it uses matplotlib to draw a graph, the redline is the output from my sin3 function, and the blue line is the correct output from the math.sin method.
As you can see the curve is not quite right, it's not high or low enough (seems to peak at 0.5), and also has strange behavior where it generates a small peak around 0.25 then drops down again. How can I adjust my function to match the correct output of math.sin?
You have the wrong equation for sin(x), and you also have a messed up loop invariant.
The formula for sin(x) is x/1! - x^3/3! + x^5/5! - x^7/7!..., so I really don't know why you're initializing dn to something involving x^2.
You also want to ask yourself: What is my loop invariant? What is the value of dn when I reach the start of my loop. It is clear from the way you update dn that you expect it to be something involving x^i / i!. Yet on the very first iteration of the loop, i=4, yet dn involves x^2.
Here is what you meant to write:
def sin3(x, i=30):
x %= 2 * math.pi
n = 0
dn = x
for c in range(1, 2 * i + 4, 2):
n += dn
dn *= -x**2 / ((c + 1) * (c + 2))
return n
I have tried to use a toy problem of linear regression for implanting the optimisation on the MSE function using the algorithm of gradient decent.
import numpy as np
# Data points
x = np.array([1, 2, 3, 4])
y = np.array([1, 1, 2, 2])
# MSE function
f = lambda a, b: 1 / len(x) * np.sum(np.power(y - (a * x + b), 2))
# Gradient
def grad_f(v_coefficients):
a = v_coefficients[0, 0]
b = v_coefficients[1, 0]
return np.array([1 / len(x) * np.sum(2 * (y - (a * x + b)) * x),
1 / len(x) * np.sum(2 * (y - (a * x + b)))]).reshape(2, 1)
# Gradient Decent with epsilon as tol vector and alpha as the step/learning rate
def gradient_decent(v_prev):
tol = 10 ** -3
epsilon = np.array([tol * np.ones([2, 1], int)])
alpha = 0.2
v_next = v_prev - alpha * grad_f(v_prev)
if (np.abs(v_next - v_prev) <= epsilon).all():
return v_next
else:
gradient_decent(v_next)
# v_0 is the initial guess
v_0 = np.array([[1], [1]])
gradient_decent(v_0)
I have tried different alpha values but the code never converges (infinite recursion) it seems that the issue is with the stop condition of the recursion, but after few runs the v_next and v_prev bounces between -infinte to infinite
It's great that you are learning machine learning (^_^) by implementing some base algorithms by yourself. Regarding your question, there are two problems in your code, first one is mathematical, the sign in:
def grad_f(v_coefficients):
a = v_coefficients[0, 0]
b = v_coefficients[1, 0]
return np.array([1 / len(x) * np.sum(2 * (y - (a * x + b)) * x),
1 / len(x) * np.sum(2 * (y - (a * x + b)))]).reshape(2, 1)
should be
return -np.array(...)
since
the second one is programming, this kind of code will not return you a result in Python:
def add(x):
new_x = x + 1
if new_x > 10:
return new_x
else:
add(new_x)
you must use return in both clauses of the if statement, so it should be
def add(x):
new_x = x + 1
if new_x > 10:
return new_x
else:
return add(new_x)
There is also a minor issue with the alpha coefficient for these particular data points alpha=0.2 is too big for algorithm to converge, you need to use smaller alpha. I also slightly refactor your initial code using numpy broadcasting convention (https://numpy.org/doc/stable/user/basics.broadcasting.html) to get the following result:
import numpy as np
# Data points
x = np.array([1, 2, 3, 4])
y = np.array([1, 1, 2, 2])
# MSE function
f = lambda a, b: np.mean(np.power(y - (a * x + b), 2))
# Gradient
def grad_f(v_coefficients):
a = v_coefficients[0, 0]
b = v_coefficients[1, 0]
return -np.array([np.mean(2 * (y - (a * x + b)) * x),
np.mean(2 * (y - (a * x + b)))]).reshape(2, 1)
# Gradient Decent with epsilon as tol vector and alpha as the step/learning rate
def gradient_decent(v_prev):
tol = 1e-3
# epsilon = np.array([tol * np.ones([2, 1], int)]) do not need this, due to numpy broadcasting rules
alpha = 0.1
v_next = v_prev - alpha * grad_f(v_prev)
if (np.abs(v_next - v_prev) <= alpha).all():
return v_next
else:
return gradient_decent(v_next)
# v_0 is the initial guess
v_0 = np.array([[1], [1]])
gradient_decent(v_0)
I have tried to implement gradient descent myself using Python. I know there are similar topics on this, but for my attempt, my guess slope can always get really close to the real slope, but the guess intercept never matched or even come close to the real intercept. Does anyone know why is that happening?
Also, I read a lot of gradient descent post and formula, it says for each iteration, I need to multiply the gradient by the negative learning rate and repeat until it converges. As you can see in my implementation below, my gradient descent only works when I multiply the learning rate to the gradient and not by -1. Why is that? Did I understand the gradient descent wrong or is my implementation wrong? (The exam_m and exam_b will quickly go overflow if I multiply the learning rate and gradient by -1)
intercept = -5
slope = -4
x = []
y = []
for i in range(0, 100):
x.append(i/300)
y.append((i * slope + intercept)/300)
learning_rate = 0.005
# y = mx + b
# m is slope, b is y-intercept
exam_m = 100
exam_b = 100
#iteration
#My error function is sum all (y - guess) ^2
for _ in range(20000):
gradient_m = 0
gradient_b = 0
for i in range(len(x)):
gradient_m += (y[i] - exam_m * x[i] - exam_b) * x[i]
gradient_b += (y[i] - exam_m * x[i] - exam_b)
#why not gradient_m -= (y[i] - exam_m * x[i] - exam_b) * x[i] like what it said in the gradient descent formula
exam_m += learning_rate * gradient_m
exam_b += learning_rate * gradient_b
print(exam_m, exam_b)
The reason for overflow is the missing factor (2/n). I have broadly shown the use of negative signs for more clarification.
import numpy as np
import matplotlib.pyplot as plt
intercept = -5
slope = -4
# y = mx + b
x = []
y = []
for i in range(0, 100):
x.append(i/300)
y.append((i * slope + intercept)/300)
n = len(x)
x = np.array(x)
y = np.array(y)
learning_rate = 0.05
exam_m = 0
exam_b = 0
epochs = 1000
for _ in range(epochs):
gradient_m = 0
gradient_b = 0
for i in range(n):
gradient_m -= (y[i] - exam_m * x[i] - exam_b) * x[i]
gradient_b -= (y[i] - exam_m * x[i] - exam_b)
exam_m = exam_m - (2/n)*learning_rate * gradient_m
exam_b = exam_b - (2/n)*learning_rate * gradient_b
print('Slope, Intercept: ', exam_m, exam_b)
y_pred = exam_m*x + exam_b
plt.xlabel('x')
plt.ylabel('y')
plt.plot(x, y_pred, '--', color='black', label='predicted_line')
plt.plot(x, y, '--', color='blue', label='orginal_line')
plt.legend()
plt.show()
Output:
Slope, Intercept: -2.421033215481844 -0.2795651072061604
I will very briefly try to explain what I'm doing to those who are less experienced with mathematics, it's really quite simple.
We are trying to fill a grid, as follows:
We find the orange point, U(j,n+1), using three points in a row below it, U(j-1,n), U(j,n), U(j,n+1)
Where the value of U in the entire bottom row is given, and is periodic. So theoretically we can fill this entire grid.
The formula for calculating the orange point is:
U(j,n+1) = U(j,n) + (delta_t / (2 * delta_x)) * (U(j+1,n) - U(j-1,n))
We can write it easily as a system of linear equations as follows:
And now we just repeat this process of multiplying by this matrix (iterating through the time variable) as much as we want. That's a simple way to numerically approximate a solution to a partial differential equation.
I wrote a code that does this, and then I compare my final row, to the known solution of the differential equation.
This is the code
import math
import numpy
def f(x):
return math.cos(2 * math.pi * x)
def solution(x, t):
return math.cos(2 * math.pi * (x + t))
# setting everything up
N = 16
Lambda = 10 ** (-20)
Delta_x = 1/(N+1)
Delta_t = Lambda * Delta_x * Delta_x
t_f = 5
v_0 = numpy.zeros((N, 1))
# Filling first row, initial condition was given
for i in range(N):
v_0[i, 0] = f(i * Delta_x)
# Create coefficient matrix
M = numpy.zeros((N, N))
for i in range(N):
M[i, i - 1] = -Delta_t / (2 * Delta_x)
M[i, i] = 1
M[i, (i + 1) % N] = Delta_t / (2 * Delta_x)
# start iterating through time
v_i = v_0
for i in range(math.floor(t_f / Delta_t) - 1):
v_i = numpy.dot(M, v_i)
v_final = v_i
if (Delta_t * math.ceil(t_f / Delta_t) != t_f): #we don't reach t_f exactly using Delta_t
v_final = (1/2) * (v_i + numpy.dot(M, v_i))
u = numpy.zeros(v_final.shape)
for i in range(N):
u[i, 0] = solution(i * Delta_x, t_f)
for x in range(v_final.shape[0]):
print (v_final[x], u[x])
theoretically speaking, I should be able to find lambda small enough such that v_final and the known solution, u, will be very similar.
But I can't. No matter how small I make lambda, how finde I make the grid, I seem to converge to something incorrect. They aren't close.
I can't for the life of me figure out the problem.
Does anyone have an idea what might be wrong?
You should have Delta_x = 1.0/N, as you divide the interval into N cells.
You get N+1 points on the grid from u[0] to u[N], but as per boundary condition u[N]=u[0], there you also only use an array of length N to hold all the node values.
Per your given formulas you have gamma = dt/(2*dx), thus the reverse computation should be dt = gamma*2*dx or in your variable names
Delta_t = Lambda * 2 * Delta_x
Or you are aiming at the error of the method which is O(dt, dx²) so that it would make sense to have dt = c*dx^2, but not with a ridiculous factor like of c=1e-20, if you want the time discretization error small against the space discretization error, c=0.1 or c=0.01 should be sufficient.
import numpy as np
def f(x):
return np.cos(2 * np.pi * x)
def solution(x, t):
return f(x + t)
# setting everything up
N_x = 16
Lambda = 1e-2
Delta_x = 1./N_x
Delta_t = Lambda * Delta_x * Delta_x
t_f = 5
N_t = int(t_f/Delta_t+0.5); t_f = N_t*Delta_t
# Filling first row, initial condition was given
x = np.arange(0,N_x,1) * Delta_x
v_0 = f(x)
# Create coefficient matrix
M = np.zeros((N_x, N_x))
for i in range(N_x):
M[i, i - 1] = -Delta_t / (2 * Delta_x)
M[i, i] = 1
M[i, (i + 1) % N_x] = Delta_t / (2 * Delta_x)
# start iterating through time
v_i = v_0[:]
for i in range(N_t):
v_i = np.dot(M, v_i)
v_final = v_i
u = solution(x, t_f)
for vx, ux in zip(v_final, u):
print (vx, ux)
The Euler method is also not the most precise method, the expected error is in the range exp(L*t_f)*dx^2 = e^5/N_x^2=0.58 for N_x=16 where L=1 was taken as approximate Lipschitz constant. Now if you increase to N_x=50 this error estimate reduces to 0.06 which is also visible in the results.
The t exact solution of the x discretized problem is cos(2*pi*(x+c*t)) where c=sin(2*pi*dx)/(2*pi*dx). If you compare against that formula, the errors should be really small of size O(dt).
I am trying to code logistic regression from scratch. In this code I have, I thought my cost derivative was my regularization, but I've been tasked with adding L1norm regularization. How do you add this in python? Should this be added where I have defined the cost derivative? Any help in the right direction is appreciated.
def Sigmoid(z):
return 1/(1 + np.exp(-z))
def Hypothesis(theta, X):
return Sigmoid(X # theta)
def Cost_Function(X,Y,theta,m):
hi = Hypothesis(theta, X)
_y = Y.reshape(-1, 1)
J = 1/float(m) * np.sum(-_y * np.log(hi) - (1-_y) * np.log(1-hi))
return J
def Cost_Function_Derivative(X,Y,theta,m,alpha):
hi = Hypothesis(theta,X)
_y = Y.reshape(-1, 1)
J = alpha/float(m) * X.T # (hi - _y)
return J
def Gradient_Descent(X,Y,theta,m,alpha):
new_theta = theta - Cost_Function_Derivative(X,Y,theta,m,alpha)
return new_theta
def Accuracy(theta):
correct = 0
length = len(X_test)
prediction = (Hypothesis(theta, X_test) > 0.5)
_y = Y_test.reshape(-1, 1)
correct = prediction == _y
my_accuracy = (np.sum(correct) / length)*100
print ('LR Accuracy: ', my_accuracy, "%")
def Logistic_Regression(X,Y,alpha,theta,num_iters):
m = len(Y)
for x in range(num_iters):
new_theta = Gradient_Descent(X,Y,theta,m,alpha)
theta = new_theta
if x % 100 == 0:
print #('theta: ', theta)
print #('cost: ', Cost_Function(X,Y,theta,m))
Accuracy(theta)
ep = .012
initial_theta = np.random.rand(X_train.shape[1],1) * 2 * ep - ep
alpha = 0.5
iterations = 10000
Logistic_Regression(X_train,Y_train,alpha,initial_theta,iterations)
Regularization adds a term to the cost function so that there is a compromise between minimize cost and minimizing the model parameters to reduce overfitting. You can control how much compromise you would like by adding a scalar e for the regularization term.
So just add the L1 norm of theta to the original cost function:
J = J + e * np.sum(abs(theta))
Since this term is added to the cost function, then it should be considered when computing the gradient of the cost function.
This is simple since the derivative of the sum is the sum of derivatives. So now just need to figure out what is the derivate of the term sum(abs(theta)). Since it is a linear term, then the derivative is constant. It is = 1 if theta >= 0, and -1 if theta < 0 (note there is a mathematical undeterminity at 0, but we don't care about it).
So in the function Cost_Function_Derivative we add:
J = J + alpha * e * (theta >= 0).astype(float)