Pure Python Implementation of gradient descent

Pure Python Implementation of gradient descent - python

I have tried to implement gradient descent myself using Python. I know there are similar topics on this, but for my attempt, my guess slope can always get really close to the real slope, but the guess intercept never matched or even come close to the real intercept. Does anyone know why is that happening?
Also, I read a lot of gradient descent post and formula, it says for each iteration, I need to multiply the gradient by the negative learning rate and repeat until it converges. As you can see in my implementation below, my gradient descent only works when I multiply the learning rate to the gradient and not by -1. Why is that? Did I understand the gradient descent wrong or is my implementation wrong? (The exam_m and exam_b will quickly go overflow if I multiply the learning rate and gradient by -1)
intercept = -5
slope = -4
x = []
y = []
for i in range(0, 100):
x.append(i/300)
y.append((i * slope + intercept)/300)
learning_rate = 0.005
# y = mx + b
# m is slope, b is y-intercept
exam_m = 100
exam_b = 100
#iteration
#My error function is sum all (y - guess) ^2
for _ in range(20000):
gradient_m = 0
gradient_b = 0
for i in range(len(x)):
gradient_m += (y[i] - exam_m * x[i] - exam_b) * x[i]
gradient_b += (y[i] - exam_m * x[i] - exam_b)
#why not gradient_m -= (y[i] - exam_m * x[i] - exam_b) * x[i] like what it said in the gradient descent formula
exam_m += learning_rate * gradient_m
exam_b += learning_rate * gradient_b
print(exam_m, exam_b)

The reason for overflow is the missing factor (2/n). I have broadly shown the use of negative signs for more clarification.
import numpy as np
import matplotlib.pyplot as plt
intercept = -5
slope = -4
# y = mx + b
x = []
y = []
for i in range(0, 100):
x.append(i/300)
y.append((i * slope + intercept)/300)
n = len(x)
x = np.array(x)
y = np.array(y)
learning_rate = 0.05
exam_m = 0
exam_b = 0
epochs = 1000
for _ in range(epochs):
gradient_m = 0
gradient_b = 0
for i in range(n):
gradient_m -= (y[i] - exam_m * x[i] - exam_b) * x[i]
gradient_b -= (y[i] - exam_m * x[i] - exam_b)
exam_m = exam_m - (2/n)*learning_rate * gradient_m
exam_b = exam_b - (2/n)*learning_rate * gradient_b
print('Slope, Intercept: ', exam_m, exam_b)
y_pred = exam_m*x + exam_b
plt.xlabel('x')
plt.ylabel('y')
plt.plot(x, y_pred, '--', color='black', label='predicted_line')
plt.plot(x, y, '--', color='blue', label='orginal_line')
plt.legend()
plt.show()
Output:
Slope, Intercept: -2.421033215481844 -0.2795651072061604

Related

Implementation of gradient descent blowing up to infinity?

This is how I generated the training data for my Linear Regression.
!pip install grapher, numpy
from grapher import Grapher
import matplotlib.pyplot as plt
import numpy as np
# Secret: y = 3x + 4
# x, y = [float(row[0]) for row in rows], [float(row[5]) for row in rows]
x, y = [a for a in range(-20, 20)], [3*a + 4 for a in range(-20, 20)]
g = Grapher(['3*x + 4'], title="y = 3x+4")
plt.scatter(x, y)
g.plot()
Then, I tried gradient descent on a simple quadratic function (x - 7)^2
def n(x):
return (x-7)**2
cur_x = 0
lr = 0.001
ittr = 10000
n = 0
prev_x = -1
max_precision = 0.0000001
precision = 1
while n < ittr and precision > max_precision:
prev_x = cur_x
cur_x = cur_x - lr * (2*(cur_x - 7))
precision = abs(prev_x - cur_x)
n+=1
if n%100 == 0:
print(n, ':')
print(cur_x)
print()
print(cur_x)
And this works perfectly.
Then I made a Linear Regression class to make the same thing happen.
class LinearRegression:
def __init__(self, X, Y):
self.X = X
self.Y = Y
self.m = 1
self.c = 0
self.learning_rate = 0.01
self.max_precision = 0.000001
self.itter = 10000
def h(self, x, m, c):
return m * x + c
def J(self, m, c):
loss = 0
for x in self.X:
loss += (self.h(x, m, c) - self.Y[self.X.index(x)])**2
return loss/2
def calc_loss(self):
return self.J(self.m, self.c)
def guess_answer(self, step=1):
losses = []
mcvalues = []
for m in np.arange(-10, 10, step):
for c in np.arange(-10, 10, step):
mcvalues.append((m, c))
losses.append(self.J(m, c))
minloss = sorted(losses)[0]
return mcvalues[losses.index(minloss)]
def gradient_decent(self):
print('Orignal: ', self.m, self.c)
nm = 0
nc = 0
prev_m = 0
perv_c = -1
mprecision = 1
cprecision = 1
while nm < self.itter and mprecision > self.max_precision:
prev_m = self.m
nm += 1
self.m = self.m - self.learning_rate * sum([(self.h(x, self.m, self.c) - self.Y[self.X.index(x)])*x for x in self.X])
mprecision = abs(self.m - prev_m)
return self.m, self.c
def graph_loss(self):
plt.scatter(0, self.J(0))
print(self.J(0))
plt.plot(self.X, [self.J(x) for x in self.X])
def check_loss(self):
plt.plot([m for m in range(-20, 20)], [self.J(m, 0) for m in range(-20, 20)])
x1 = 10
y1 = self.J(x1, 0)
l = sum([(self.h(x, x1, self.c) - self.Y[self.X.index(x)])*x for x in self.X])
print(l)
plt.plot([m for m in range(-20, 20)], [(l*(m - x1)) + y1 for m in range(-20, 20)])
plt.scatter([x1], [y1])
LinearRegression(x, y).gradient_decent()
Output is
Orignal: 1 0
(nan, 0)
Then I tried graphing my Loss Function (J(m, c)) and tried to use its derivative to see if it actuallly gives slope. I was in a suspection that I have messed up my d(J(m, c))/dm
After running LinearRegression(x, y).check_loss()
I get this graph
It is a slope at whatever point I want it to be. Why isnt it working in my code?

Now that I see, the main problem is with the learning rate. Learning rate of 0.01 is too high. Keeping it lower than 0.00035 works well. About 0.0002 works well and quick. I tried graphing things, and saw it made a lot of difference.
With a learning rate of 0.00035 and 1000 iterations, this was the graph:
With a learning rate of 0.0002 and 1000 iterations, this was the graph:
With a learning rate of 0.0004 and just 10 iterations, this was the graph:
Instead of converging to the point, its diverging. THat is why learning rate is important and anything bigger than 0.0004 will result in the same.
It took me quite some time to figure out.

Gradient Descent is not converging for very large values in a small dataset

I am trying to write a program to calculate the slope and the intercept of a linear regression model but when I am running more than 10 iterations, the gradient descent function gives the np.nan value for both intercept as well as slope.
Below is my implementation
def get_gradient_at_b(x, y, b, m):
N = len(x)
diff = 0
for i in range(N):
x_val = x[i]
y_val = y[i]
diff += (y_val - ((m * x_val) + b))
b_gradient = -(2/N) * diff
return b_gradient
def get_gradient_at_m(x, y, b, m):
N = len(x)
diff = 0
for i in range(N):
x_val = x[i]
y_val = y[i]
diff += x_val * (y_val - ((m * x_val) + b))
m_gradient = -(2/N) * diff
return m_gradient
def step_gradient(b_current, m_current, x, y, learning_rate):
b_gradient = get_gradient_at_b(x, y, b_current, m_current)
m_gradient = get_gradient_at_m(x, y, b_current, m_current)
b = b_current - (learning_rate * b_gradient)
m = m_current - (learning_rate * m_gradient)
return [b, m]
def gradient_descent(x, y, learning_rate, num_iterations):
b = 0
m = 0
for i in range(num_iterations):
b, m = step_gradient(b, m, x, y, learning_rate)
return [b,m]
I am running it on the following data:
a=[3.87656018e+11, 4.10320300e+11, 4.15730874e+11, 4.52699998e+11,
4.62146799e+11, 4.78965491e+11, 5.08068952e+11, 5.99592902e+11,
6.99688853e+11, 8.08901077e+11, 9.20316530e+11, 1.20111177e+12,
1.18695276e+12, 1.32394030e+12, 1.65661707e+12, 1.82304993e+12,
1.82763786e+12, 1.85672212e+12, 2.03912745e+12, 2.10239081e+12,
2.27422971e+12, 2.60081824e+12]
b=[3.3469950e+10, 3.4784980e+10, 3.3218720e+10, 3.6822490e+10,
4.4560290e+10, 4.3826720e+10, 5.2719430e+10, 6.3842550e+10,
8.3535940e+10, 1.0309053e+11, 1.2641405e+11, 1.6313218e+11,
1.8529536e+11, 1.7875143e+11, 2.4981555e+11, 3.0596392e+11,
3.0040058e+11, 3.1440530e+11, 3.1033848e+11, 2.6229109e+11,
2.7585243e+11, 3.0352616e+11]
print(gradient_descent(a, b, 0.01, 100))
#result --> [nan, nan]
When I run the gradient_descent function on a dataset with smaller values, it gives the correct answers. Also I was able to obtain the intercept and slope for the above data with from sklearn.linear_model import LinearRegression
Any help will be appreciated in figuring out why the result is [nan, nan] instead of giving me the correct intercept and slope.

You need to reduce the learning rate. Since the values in a and b are so large (>= 1e11), the learning rate needs be approximately 1e-25 for this to even do the gradient descent, else it will randomly overshoot because of large gradients of a and b.
b, m = gradient_descent(a, b, 5e-25, 100)
print(b, m)
Out: -3.7387067636195266e-13 0.13854551291084335

Multivariate Regression Numpy for Math Homework

I'm looking to use multivariate regression with least squares as my cost function to find a,b,c for ax^2 +bx + c that best fits cos(x) from (-2,2). My cost won't decrease but is ridiculously high- what I am doing wrong?
x = np.linspace(-2,2,100)
y = np.cos(x)
theta = np.random.random((3,1))
m = len(y)
for i in range(10000):
#Calculate my y_hat
y_hat = np.array([(theta[0]*(a**2) + theta[1]*a + theta[2]) for a in x])
#Calculate my cost based off y_hat and y
cost = np.sum((y_hat - y) ** 2) * (1/m)
#Calculate my derivatives based off y_hat and x
da = (2 / m) * np.sum((y_hat - y) * (x**2))
db = (2 / m) * np.sum((y_hat - y) * (x))
dc = (2 / m) * np.sum((y_hat - y))
#update step
theta[0] = theta[0] - 0.0001*(da)
theta[1] = theta[1] - 0.0001*(db)
theta[2] = theta[2] - 0.0001*(dc)
print("Epoch Num: {} Cost: {}".format(i, cost))
print(theta)

You're calculation of y_hat is slightly incorrect. It's currently a 2D array of shape (100,1).
This should help. It pulls the "zeroith" element from each of the rows:
theta_ = [(theta[0]*(a**2) + theta[1]*a + theta[2]) for a in x]
y_hat = np.array([t[0] for t in theta_])

How to add L1 normalization in python?

I am trying to code logistic regression from scratch. In this code I have, I thought my cost derivative was my regularization, but I've been tasked with adding L1norm regularization. How do you add this in python? Should this be added where I have defined the cost derivative? Any help in the right direction is appreciated.
def Sigmoid(z):
return 1/(1 + np.exp(-z))
def Hypothesis(theta, X):
return Sigmoid(X # theta)
def Cost_Function(X,Y,theta,m):
hi = Hypothesis(theta, X)
_y = Y.reshape(-1, 1)
J = 1/float(m) * np.sum(-_y * np.log(hi) - (1-_y) * np.log(1-hi))
return J
def Cost_Function_Derivative(X,Y,theta,m,alpha):
hi = Hypothesis(theta,X)
_y = Y.reshape(-1, 1)
J = alpha/float(m) * X.T # (hi - _y)
return J
def Gradient_Descent(X,Y,theta,m,alpha):
new_theta = theta - Cost_Function_Derivative(X,Y,theta,m,alpha)
return new_theta
def Accuracy(theta):
correct = 0
length = len(X_test)
prediction = (Hypothesis(theta, X_test) > 0.5)
_y = Y_test.reshape(-1, 1)
correct = prediction == _y
my_accuracy = (np.sum(correct) / length)*100
print ('LR Accuracy: ', my_accuracy, "%")
def Logistic_Regression(X,Y,alpha,theta,num_iters):
m = len(Y)
for x in range(num_iters):
new_theta = Gradient_Descent(X,Y,theta,m,alpha)
theta = new_theta
if x % 100 == 0:
print #('theta: ', theta)
print #('cost: ', Cost_Function(X,Y,theta,m))
Accuracy(theta)
ep = .012
initial_theta = np.random.rand(X_train.shape[1],1) * 2 * ep - ep
alpha = 0.5
iterations = 10000
Logistic_Regression(X_train,Y_train,alpha,initial_theta,iterations)

Regularization adds a term to the cost function so that there is a compromise between minimize cost and minimizing the model parameters to reduce overfitting. You can control how much compromise you would like by adding a scalar e for the regularization term.
So just add the L1 norm of theta to the original cost function:
J = J + e * np.sum(abs(theta))
Since this term is added to the cost function, then it should be considered when computing the gradient of the cost function.
This is simple since the derivative of the sum is the sum of derivatives. So now just need to figure out what is the derivate of the term sum(abs(theta)). Since it is a linear term, then the derivative is constant. It is = 1 if theta >= 0, and -1 if theta < 0 (note there is a mathematical undeterminity at 0, but we don't care about it).
So in the function Cost_Function_Derivative we add:
J = J + alpha * e * (theta >= 0).astype(float)

Implementing naive gradient descent in python

I'm trying to implement a very naive gradient descent in python. However, it looks like it goes into an infinite loop. Could you please help me debug it?
y = lambda x : x**2
dy_dx = lambda x : 2*x
def gradient_descent(function,derivative,initial_guess):
optimum = initial_guess
while derivative(optimum) != 0:
optimum = optimum - derivative(optimum)
else:
return optimum
gradient_descent(y,dy_dx,5)
Edit:
Now I have this code, I really can't comprehend the output. P.s. It might freeze your CPU.
y = lambda x : x**2
dy_dx = lambda x : 2*x
def gradient_descent(function,derivative,initial_guess):
optimum = initial_guess
while abs(derivative(optimum)) > 0.01:
optimum = optimum - 2*derivative(optimum)
print((optimum,derivative(optimum)))
else:
return optimum
gradient_descent(y,dy_dx,5)
Now I'm trying to apply it to a regression problem, however the output doesn't appear to be correct as shown in the output below:
Output of gradient descent code below
import matplotlib.pyplot as plt
def stepGradient(x,y, step):
b_current = 0
m_current = 0
b_gradient = 0
m_gradient = 0
N = int(len(x))
for i in range(0, N):
b_gradient += -(1/N) * (y[i] - ((m_current*x[i]) + b_current))
m_gradient += -(1/N) * x[i] * (y[i] - ((m_current * x[i]) + b_current))
while abs(b_gradient) > 0.01 and abs(m_gradient) > 0.01:
b_current = b_current - (step * b_gradient)
m_current = m_current - (step * m_gradient)
for i in range(0, N):
b_gradient += -(1/N) * (y[i] - ((m_current*x[i]) + b_current))
m_gradient += -(1/N) * x[i] * (y[i] - ((m_current * x[i]) + b_current))
return [b_current, m_current]
x = [1,2, 2,3,4,5,7,8]
y = [1.5,3,1,3,2,5,6,7]
step = 0.00001
(b,m) = stepGradient(x,y,step)
plt.scatter(x,y)
abline_values = [m * i + b for i in x]
plt.plot(x, abline_values, 'b')
plt.show()
Fixed :D
import matplotlib.pyplot as plt
def stepGradient(x,y):
step = 0.001
b_current = 0
m_current = 0
b_gradient = 0
m_gradient = 0
N = int(len(x))
for i in range(0, N):
b_gradient += -(1/N) * (y[i] - ((m_current*x[i]) + b_current))
m_gradient += -(1/N) * x[i] * (y[i] - ((m_current * x[i]) + b_current))
while abs(b_gradient) > 0.01 or abs(m_gradient) > 0.01:
b_current = b_current - (step * b_gradient)
m_current = m_current - (step * m_gradient)
b_gradient= 0
m_gradient = 0
for i in range(0, N):
b_gradient += -(1/N) * (y[i] - ((m_current*x[i]) + b_current))
m_gradient += -(1/N) * x[i] * (y[i] - ((m_current * x[i]) + b_current))
return [b_current, m_current]
x = [1,2, 2,3,4,5,7,8,10]
y = [1.5,3,1,3,2,5,6,7,20]
(b,m) = stepGradient(x,y)
plt.scatter(x,y)
abline_values = [m * i + b for i in x]
plt.plot(x, abline_values, 'b')
plt.show()

Your while loop stops only when a calculated floating-point value equals zero. This is naïve, since floating-point values are rarely calculated exactly. Instead, stop the loop when the calculated value is close enough to zero. Use something like
while math.abs(derivative(optimum)) > eps:
where eps is the desired precision of the calculated value. This could be made another parameter, perhaps with a default value of 1e-10 or some such.
That said, the problem in your case is worse. Your algorithm is far too naïve in assuming that the calculation
optimum = optimum - 2*derivative(optimum)
will move the value of optimum closer to the actual optimum value. In your particular case, the variable optimum just cycles back and forth between 5 (your initial guess) and -5. Note that the derivative at 5 is 10 and the derivative at -5 is -10.
So you need to avoid such cycling. You could multiply your delta 2*derivative(optimum) by something smaller than 1, which would work in your particular case y=x**2. But this will not work in general.
To be completely safe, 'bracket' your optimum point with a smaller value and a larger value, and use the derivative to find the next guess. But ensure that your next guess does not go outside the bracketed interval. If it does, or if the convergence of your guesses is too slow, use another method such as bisection or golden mean search.
Of course, this means your 'very naïve gradient descent' algorithm is too naïve to work in general. That's why real optimization routines are more complicated.

You also need to decrease your step size (gamma in the gradient descent formula):
y = lambda x : x**2
dy_dx = lambda x : 2*x
def gradient_descent(function,derivative,initial_guess):
optimum = initial_guess
while abs(derivative(optimum)) > 0.01:
optimum = optimum - 0.01*derivative(optimum)
print((optimum,derivative(optimum)))
else:
return optimum

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.