Implementation of gradient descent blowing up to infinity?

Implementation of gradient descent blowing up to infinity? - python

This is how I generated the training data for my Linear Regression.
!pip install grapher, numpy
from grapher import Grapher
import matplotlib.pyplot as plt
import numpy as np
# Secret: y = 3x + 4
# x, y = [float(row[0]) for row in rows], [float(row[5]) for row in rows]
x, y = [a for a in range(-20, 20)], [3*a + 4 for a in range(-20, 20)]
g = Grapher(['3*x + 4'], title="y = 3x+4")
plt.scatter(x, y)
g.plot()
Then, I tried gradient descent on a simple quadratic function (x - 7)^2
def n(x):
return (x-7)**2
cur_x = 0
lr = 0.001
ittr = 10000
n = 0
prev_x = -1
max_precision = 0.0000001
precision = 1
while n < ittr and precision > max_precision:
prev_x = cur_x
cur_x = cur_x - lr * (2*(cur_x - 7))
precision = abs(prev_x - cur_x)
n+=1
if n%100 == 0:
print(n, ':')
print(cur_x)
print()
print(cur_x)
And this works perfectly.
Then I made a Linear Regression class to make the same thing happen.
class LinearRegression:
def __init__(self, X, Y):
self.X = X
self.Y = Y
self.m = 1
self.c = 0
self.learning_rate = 0.01
self.max_precision = 0.000001
self.itter = 10000
def h(self, x, m, c):
return m * x + c
def J(self, m, c):
loss = 0
for x in self.X:
loss += (self.h(x, m, c) - self.Y[self.X.index(x)])**2
return loss/2
def calc_loss(self):
return self.J(self.m, self.c)
def guess_answer(self, step=1):
losses = []
mcvalues = []
for m in np.arange(-10, 10, step):
for c in np.arange(-10, 10, step):
mcvalues.append((m, c))
losses.append(self.J(m, c))
minloss = sorted(losses)[0]
return mcvalues[losses.index(minloss)]
def gradient_decent(self):
print('Orignal: ', self.m, self.c)
nm = 0
nc = 0
prev_m = 0
perv_c = -1
mprecision = 1
cprecision = 1
while nm < self.itter and mprecision > self.max_precision:
prev_m = self.m
nm += 1
self.m = self.m - self.learning_rate * sum([(self.h(x, self.m, self.c) - self.Y[self.X.index(x)])*x for x in self.X])
mprecision = abs(self.m - prev_m)
return self.m, self.c
def graph_loss(self):
plt.scatter(0, self.J(0))
print(self.J(0))
plt.plot(self.X, [self.J(x) for x in self.X])
def check_loss(self):
plt.plot([m for m in range(-20, 20)], [self.J(m, 0) for m in range(-20, 20)])
x1 = 10
y1 = self.J(x1, 0)
l = sum([(self.h(x, x1, self.c) - self.Y[self.X.index(x)])*x for x in self.X])
print(l)
plt.plot([m for m in range(-20, 20)], [(l*(m - x1)) + y1 for m in range(-20, 20)])
plt.scatter([x1], [y1])
LinearRegression(x, y).gradient_decent()
Output is
Orignal: 1 0
(nan, 0)
Then I tried graphing my Loss Function (J(m, c)) and tried to use its derivative to see if it actuallly gives slope. I was in a suspection that I have messed up my d(J(m, c))/dm
After running LinearRegression(x, y).check_loss()
I get this graph
It is a slope at whatever point I want it to be. Why isnt it working in my code?

Now that I see, the main problem is with the learning rate. Learning rate of 0.01 is too high. Keeping it lower than 0.00035 works well. About 0.0002 works well and quick. I tried graphing things, and saw it made a lot of difference.
With a learning rate of 0.00035 and 1000 iterations, this was the graph:
With a learning rate of 0.0002 and 1000 iterations, this was the graph:
With a learning rate of 0.0004 and just 10 iterations, this was the graph:
Instead of converging to the point, its diverging. THat is why learning rate is important and anything bigger than 0.0004 will result in the same.
It took me quite some time to figure out.

Related

Why I'm getting a huge cost in Stochastic Gradient Descent Implementation?

I've run into some problems while trying to implement Stochastic Gradient Descent, and basically what is happening is that my cost is growing like crazy and I don't have a clue why.
MSE implementation:
def mse(x,y,w,b):
predictions = x # w
summed = (np.square(y - predictions - b)).mean(0)
cost = summed / 2
return cost
Gradients:
def grad_w(y,x,w,b,n_samples):
return -y # x / n_samples + x.T # x # w / n_samples + b * x.mean(0)
def grad_b(y,x,w,b,n_samples):
return -y.mean(0) + x.mean(0) # w + b
SGD Implementation:
def stochastic_gradient_descent(X,y,w,b,learning_rate=0.01,iterations=500,batch_size =100):
length = len(y)
cost_history = np.zeros(iterations)
n_batches = int(length/batch_size)
for it in range(iterations):
cost =0
indices = np.random.permutation(length)
X = X[indices]
y = y[indices]
for i in range(0,length,batch_size):
X_i = X[i:i+batch_size]
y_i = y[i:i+batch_size]
w -= learning_rate*grad_w(y_i,X_i,w,b,length)
b -= learning_rate*grad_b(y_i,X_i,w,b,length)
cost = mse(X_i,y_i,w,b)
cost_history[it] = cost
if cost_history[it] <= 0.0052: break
return w, cost_history[:it]
Random Variables:
w_true = np.array([0.2, 0.5,-0.2])
b_true = -1
first_feature = np.random.normal(0,1,1000)
second_feature = np.random.uniform(size=1000)
third_feature = np.random.normal(1,2,1000)
arrays = [first_feature,second_feature,third_feature]
x = np.stack(arrays,axis=1)
y = x # w_true + b_true + np.random.normal(0,0.1,1000)
w = np.asarray([0.0,0.0,0.0], dtype='float64')
b = 1.0
After running this:
theta,cost_history = stochastic_gradient_descent(x,y,w,b)
print('Final cost/MSE: {:0.3f}'.format(cost_history[-1]))
I Get that:
Final cost/MSE: 3005958172614261248.000
And here is the plot

Here are a few suggestions:
your learning rate is too big for the training: changing it to something like 1e-3 should be fine.
your update part could be slightly modified as follows:
def stochastic_gradient_descent(X,y,w,b,learning_rate=0.01,iterations=500,batch_size =100):
length = len(y)
cost_history = np.zeros(iterations)
n_batches = int(length/batch_size)
for it in range(iterations):
cost =0
indices = np.random.permutation(length)
X = X[indices]
y = y[indices]
for i in range(0,length,batch_size):
X_i = X[i:i+batch_size]
y_i = y[i:i+batch_size]
w -= learning_rate*grad_w(y_i,X_i,w,b,len(X_i)) # the denominator should be the actual batch size
b -= learning_rate*grad_b(y_i,X_i,w,b,len(X_i))
cost += mse(X_i,y_i,w,b)*len(X_i) # add batch loss
cost_history[it] = cost/length # this is a running average of your batch losses, which is statistically more stable
if cost_history[it] <= 0.0052: break
return w, b, cost_history[:it]
The final results:
w_true = np.array([0.2, 0.5, -0.2])
b_true = -1
first_feature = np.random.normal(0,1,1000)
second_feature = np.random.uniform(size=1000)
third_feature = np.random.normal(1,2,1000)
arrays = [first_feature,second_feature,third_feature]
x = np.stack(arrays,axis=1)
y = x # w_true + b_true + np.random.normal(0,0.1,1000)
w = np.asarray([0.0,0.0,0.0], dtype='float64')
b = 0.0
theta,bias,cost_history = stochastic_gradient_descent(x,y,w,b,learning_rate=1e-3,iterations=3000)
print("Final epoch cost/MSE: {:0.3f}".format(cost_history[-1]))
print("True final cost/MSE: {:0.3f}".format(mse(x,y,theta,bias)))
print(f"Final coefficients:\n{theta,bias}")

Hey #TQCH and thanks for that. I've come up with a different approach to implement SGD without an inner loop and the results were also pretty sweet.
def stochastic_gradient_descent(X,y,w,b,learning_rate=0.35,iterations=3000,batch_size =100):
length = len(y)
cost_history = np.zeros(iterations)
n_batches = int(length/batch_size)
marker = 0
cost = mse(X,y,w,b)
print(cost)
for it in range(iterations):
cost =0
indices = np.random.choice(length, batch_size)
X_i = X[indices]
y_i = y[indices]
w -= learning_rate*grad_w(y_i,X_i,w,b)
b -= learning_rate*grad_b(y_i,X_i,w,b)
cost = mse(X_i,y_i,w,b)
cost_history[it] = cost
if cost_history[it] <= 0.0075 and cost_history[it] > 0.0071: marker = it
if cost <= 0.0052: break
print(f'{w}, {b}')
return w, cost_history, marker, cost
w = np.asarray([0.0,0.0,0.0], dtype='float64')
b = 1.0
theta,cost_history, marker, cost = stochastic_gradient_descent(x,y,w,b)
print(f'Number of iterations: {marker}')
print('Final cost/MSE: {:0.3f}'.format(cost))
which gave me these results:
1.9443112664859845,
[ 0.19592532 0.31735225 -0.20044424], -0.9059800816290591
Number of iterations: 68
Final cost/MSE: 0.005
But you're right I missed that I was dividing by total length of vector y and not by batch size and forgot to add batch loss!
Thanks for that!

My vectorization implementation of gradient descent does not get me the right answer

I'm currently working on Andrew Ng's gradient descent exercise using python but keeps getting me the wrong optimal theta. I followed this vectorization cheatsheet for gradient descent --- https://medium.com/ml-ai-study-group/vectorized-implementation-of-cost-functions-and-gradient-vectors-linear-regression-and-logistic-31c17bca9181.
Here is my code:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
def cost_func(X, Y, theta):
m = len(X)
H = X.dot(theta)
J = 1/(2*m) * (H - Y).T.dot(H - Y)
return J
def gradient_descent(X, Y, alpha=0.01, iterations=1500):
#initializing theta as a zero vector
theta = np.zeros(X.shape[1])
#initializing the a list of cost function value
J_list = [cost_func(X, Y, theta)]
m = len(X)
while iterations > 0:
H = X.dot(theta)
delta = (1/m)*X.T.dot(H - Y)
theta = theta - alpha * delta
iterations -= 1
J_list.append(cost_func(X, Y, theta))
return theta, J_list
def check_convergence(J_list):
plt.plot(range(len(J_list)), J_list)
plt.xlabel('Iterations')
plt.ylabel('Cost J')
plt.show()
file_name_1 = 'https://raw.githubusercontent.com/kaleko/CourseraML/master/ex1/data/ex1data1.txt'
df1 = pd.read_csv(file_name_1, header=None)
X = df1.values[:, 0]
Y = df1.values[:, 1]
m = len(X)
X = np.column_stack((np.ones(m), X))
theta_optimal, J_list = gradient_descent(X, Y, 0.01, 1500)
print(theta_optimal)
check_convergence(J_list)
My theta output is [-3.63029144 1.16636235], which is incorrect.
Here is my cost function graph. As you see, it converges way too quickly.
The correct graph should look like.
Thank you.

Gradient Descent is not converging for very large values in a small dataset

I am trying to write a program to calculate the slope and the intercept of a linear regression model but when I am running more than 10 iterations, the gradient descent function gives the np.nan value for both intercept as well as slope.
Below is my implementation
def get_gradient_at_b(x, y, b, m):
N = len(x)
diff = 0
for i in range(N):
x_val = x[i]
y_val = y[i]
diff += (y_val - ((m * x_val) + b))
b_gradient = -(2/N) * diff
return b_gradient
def get_gradient_at_m(x, y, b, m):
N = len(x)
diff = 0
for i in range(N):
x_val = x[i]
y_val = y[i]
diff += x_val * (y_val - ((m * x_val) + b))
m_gradient = -(2/N) * diff
return m_gradient
def step_gradient(b_current, m_current, x, y, learning_rate):
b_gradient = get_gradient_at_b(x, y, b_current, m_current)
m_gradient = get_gradient_at_m(x, y, b_current, m_current)
b = b_current - (learning_rate * b_gradient)
m = m_current - (learning_rate * m_gradient)
return [b, m]
def gradient_descent(x, y, learning_rate, num_iterations):
b = 0
m = 0
for i in range(num_iterations):
b, m = step_gradient(b, m, x, y, learning_rate)
return [b,m]
I am running it on the following data:
a=[3.87656018e+11, 4.10320300e+11, 4.15730874e+11, 4.52699998e+11,
4.62146799e+11, 4.78965491e+11, 5.08068952e+11, 5.99592902e+11,
6.99688853e+11, 8.08901077e+11, 9.20316530e+11, 1.20111177e+12,
1.18695276e+12, 1.32394030e+12, 1.65661707e+12, 1.82304993e+12,
1.82763786e+12, 1.85672212e+12, 2.03912745e+12, 2.10239081e+12,
2.27422971e+12, 2.60081824e+12]
b=[3.3469950e+10, 3.4784980e+10, 3.3218720e+10, 3.6822490e+10,
4.4560290e+10, 4.3826720e+10, 5.2719430e+10, 6.3842550e+10,
8.3535940e+10, 1.0309053e+11, 1.2641405e+11, 1.6313218e+11,
1.8529536e+11, 1.7875143e+11, 2.4981555e+11, 3.0596392e+11,
3.0040058e+11, 3.1440530e+11, 3.1033848e+11, 2.6229109e+11,
2.7585243e+11, 3.0352616e+11]
print(gradient_descent(a, b, 0.01, 100))
#result --> [nan, nan]
When I run the gradient_descent function on a dataset with smaller values, it gives the correct answers. Also I was able to obtain the intercept and slope for the above data with from sklearn.linear_model import LinearRegression
Any help will be appreciated in figuring out why the result is [nan, nan] instead of giving me the correct intercept and slope.

You need to reduce the learning rate. Since the values in a and b are so large (>= 1e11), the learning rate needs be approximately 1e-25 for this to even do the gradient descent, else it will randomly overshoot because of large gradients of a and b.
b, m = gradient_descent(a, b, 5e-25, 100)
print(b, m)
Out: -3.7387067636195266e-13 0.13854551291084335

How to add L1 normalization in python?

I am trying to code logistic regression from scratch. In this code I have, I thought my cost derivative was my regularization, but I've been tasked with adding L1norm regularization. How do you add this in python? Should this be added where I have defined the cost derivative? Any help in the right direction is appreciated.
def Sigmoid(z):
return 1/(1 + np.exp(-z))
def Hypothesis(theta, X):
return Sigmoid(X # theta)
def Cost_Function(X,Y,theta,m):
hi = Hypothesis(theta, X)
_y = Y.reshape(-1, 1)
J = 1/float(m) * np.sum(-_y * np.log(hi) - (1-_y) * np.log(1-hi))
return J
def Cost_Function_Derivative(X,Y,theta,m,alpha):
hi = Hypothesis(theta,X)
_y = Y.reshape(-1, 1)
J = alpha/float(m) * X.T # (hi - _y)
return J
def Gradient_Descent(X,Y,theta,m,alpha):
new_theta = theta - Cost_Function_Derivative(X,Y,theta,m,alpha)
return new_theta
def Accuracy(theta):
correct = 0
length = len(X_test)
prediction = (Hypothesis(theta, X_test) > 0.5)
_y = Y_test.reshape(-1, 1)
correct = prediction == _y
my_accuracy = (np.sum(correct) / length)*100
print ('LR Accuracy: ', my_accuracy, "%")
def Logistic_Regression(X,Y,alpha,theta,num_iters):
m = len(Y)
for x in range(num_iters):
new_theta = Gradient_Descent(X,Y,theta,m,alpha)
theta = new_theta
if x % 100 == 0:
print #('theta: ', theta)
print #('cost: ', Cost_Function(X,Y,theta,m))
Accuracy(theta)
ep = .012
initial_theta = np.random.rand(X_train.shape[1],1) * 2 * ep - ep
alpha = 0.5
iterations = 10000
Logistic_Regression(X_train,Y_train,alpha,initial_theta,iterations)

Regularization adds a term to the cost function so that there is a compromise between minimize cost and minimizing the model parameters to reduce overfitting. You can control how much compromise you would like by adding a scalar e for the regularization term.
So just add the L1 norm of theta to the original cost function:
J = J + e * np.sum(abs(theta))
Since this term is added to the cost function, then it should be considered when computing the gradient of the cost function.
This is simple since the derivative of the sum is the sum of derivatives. So now just need to figure out what is the derivate of the term sum(abs(theta)). Since it is a linear term, then the derivative is constant. It is = 1 if theta >= 0, and -1 if theta < 0 (note there is a mathematical undeterminity at 0, but we don't care about it).
So in the function Cost_Function_Derivative we add:
J = J + alpha * e * (theta >= 0).astype(float)

Replacing multiprocessing pool.map with mpi4py

I'm a beginner in using MPI, and I'm still going through the documentation. However, there's very little to work on when it comes to mpi4py. I have written a code that currently uses the multiprocessing module to run on many cores, but I need replace this with mpi4py so that I can use more than one node to run my code. My code is below, when using the multiprocessing module, and also without.
With multiprocessing,
import numpy as np
import multiprocessing
start_time = time.time()
E = 0.1
M = 5
n = 1000
G = 1
c = 1
stretch = [10, 1]
#Point-Distribution Generator Function
def CDF_inv(x, e, m):
A = 1/(1 + np.log(m/e))
if x == 1:
return m
elif 0 <= x <= A:
return e * x / A
elif A < x < 1:
return e * np.exp((x / A) - 1)
#Elliptical point distribution Generator Function
def get_coor_ellip(dist=CDF_inv, params=[E, M], stretch=stretch):
R = dist(random.random(), *params)
theta = random.random() * 2 * np.pi
return (R * np.cos(theta) * stretch[0], R * np.sin(theta) * stretch[1])
def get_dist_sq(x_array, y_array):
return x_array**2 + y_array**2
#Function to obtain alpha
def get_alpha(args):
zeta_list_part, M_list_part, X, Y = args
alpha_x = 0
alpha_y = 0
for key in range(len(M_list_part)):
z_m_z_x = X - zeta_list_part[key][0]
z_m_z_y = Y - zeta_list_part[key][1]
dist_z_m_z = get_dist_sq(z_m_z_x, z_m_z_y)
alpha_x += M_list_part[key] * z_m_z_x / dist_z_m_z
alpha_y += M_list_part[key] * z_m_z_y / dist_z_m_z
return (alpha_x, alpha_y)
#The part of the process containing the loop that needs to be parallelised, where I use pool.map()
if __name__ == '__main__':
# n processes, scale accordingly
num_processes = 10
pool = multiprocessing.Pool(processes=num_processes)
random_sample = [CDF_inv(x, E, M)
for x in [random.random() for e in range(n)]]
zeta_list = [get_coor_ellip() for e in range(n)]
x1, y1 = zip(*zeta_list)
zeta_list = np.column_stack((np.array(x1), np.array(y1)))
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x, y)
print len(x)*len(y)*n,'calculations to be carried out.'
M_list = np.array([.001 for i in range(n)])
# split zeta_list, M_list, X, and Y
zeta_list_split = np.array_split(zeta_list, num_processes, axis=0)
M_list_split = np.array_split(M_list, num_processes)
X_list = [X for e in range(num_processes)]
Y_list = [Y for e in range(num_processes)]
alpha_list = pool.map(
get_alpha, zip(zeta_list_split, M_list_split, X_list, Y_list))
alpha_x = 0
alpha_y = 0
for e in alpha_list:
alpha_x += e[0] * 4 * G / (c**2)
alpha_y += e[1] * 4 * G / (c**2)
print("%f seconds" % (time.time() - start_time))
Without multiprocessing,
import numpy as np
E = 0.1
M = 5
G = 1
c = 1
M_list = [.1 for i in range(n)]
#Point-Distribution Generator Function
def CDF_inv(x, e, m):
A = 1/(1 + np.log(m/e))
if x == 1:
return m
elif 0 <= x <= A:
return e * x / A
elif A < x < 1:
return e * np.exp((x / A) - 1)
n = 1000
random_sample = [CDF_inv(x, E, M)
for x in [random.random() for e in range(n)]]
stretch = [5, 2]
#Elliptical point distribution Generator Function
def get_coor_ellip(dist=CDF_inv, params=[E, M], stretch=stretch):
R = dist(random.random(), *params)
theta = random.random() * 2 * np.pi
return (R * np.cos(theta) * stretch[0], R * np.sin(theta) * stretch[1])
#zeta_list is the list of coordinates of a distribution of points
zeta_list = [get_coor_ellip() for e in range(n)]
x1, y1 = zip(*zeta_list)
zeta_list = np.column_stack((np.array(x1), np.array(y1)))
#Creation of a X-Y Grid
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x, y)
def get_dist_sq(x_array, y_array):
return x_array**2 + y_array**2
#Calculation of alpha, containing the loop that needs to be parallelised.
alpha_x = 0
alpha_y = 0
for key in range(len(M_list)):
z_m_z_x = X - zeta_list[key][0]
z_m_z_y = Y - zeta_list[key][1]
dist_z_m_z = get_dist_sq(z_m_z_x, z_m_z_y)
alpha_x += M_list[key] * z_m_z_x / dist_z_m_z
alpha_y += M_list[key] * z_m_z_y / dist_z_m_z
alpha_x *= 4 * G / (c**2)
alpha_y *= 4 * G / (c**2)
Basically what my code does is, it first generates a list of points that follow a certain distribution. Then I apply an equation to obtain the quantity 'alpha' using different relations between the distances of the points. The part that requires parallelisation is the single for loop involved in the calculation of alpha. What I want to do is to use mpi4py instead of multiprocessing to do this, and I am not sure how to get this going.

Transforming the multiprocessing.map version to MPI can be done using scatter / gather. In your case it is useful, that you already prepare the input list into one chunk for each rank. The main difference is, that all code gets executed by all ranks in the first place, so you must make everything that should be done only by the maste rank 0 conidtional.
if __name__ == '__main__':
comm = MPI.COMM_WORLD
if comm.rank == 0:
random_sample = [CDF_inv(x, E, M)
for x in [random.random() for e in range(n)]]
zeta_list = [get_coor_ellip() for e in range(n)]
x1, y1 = zip(*zeta_list)
zeta_list = np.column_stack((np.array(x1), np.array(y1)))
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x, y)
print len(x)*len(y)*n,'calculations to be carried out.'
M_list = np.array([.001 for i in range(n)])
# split zeta_list, M_list, X, and Y
zeta_list_split = np.array_split(zeta_list, comm.size, axis=0)
M_list_split = np.array_split(M_list, comm.size)
X_list = [X for e in range(comm.size)]
Y_list = [Y for e in range(comm.size)]
work_list = list(zip(zeta_list_split, M_list_split, X_list, Y_list))
else:
work_list = None
my_work = comm.scatter(work_list)
my_alpha = get_alpha(my_work)
alpha_list = comm.gather(my_alpha)
if comm.rank == 0:
alpha_x = 0
alpha_y = 0
for e in alpha_list:
alpha_x += e[0] * 4 * G / (c**2)
alpha_y += e[1] * 4 * G / (c**2)
This works fine as long as each processor gets a similar amount of work. If communication becomes an issue, you might want to split up the data generation among processors instead of doing it all on the master rank 0.
Note: Some things about the code are bogus, e.g. alpha_[xy] ends up as np.ndarray. The serial version runs into an error.

For people who are still interested in similar subjects, I highly recommend having a look at the MPIPoolExecutor() class here and the documentation is here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Implementation of gradient descent blowing up to infinity? - python

Related

Why I'm getting a huge cost in Stochastic Gradient Descent Implementation?

My vectorization implementation of gradient descent does not get me the right answer

Gradient Descent is not converging for very large values in a small dataset

How to add L1 normalization in python?

Replacing multiprocessing pool.map with mpi4py

Categories

Resources