Code Not Converging Vanilla Gradient Descent

Code Not Converging Vanilla Gradient Descent - python

I have a specific analytical gradient I am using to calculate my cost f(x,y), and gradients dx and dy. It runs, but I can't tell if my gradient descent is broken. Should I plot my partial derivatives x and y?
import math
gamma = 0.00001 # learning rate
iterations = 10000 #steps
theta = np.array([0,5]) #starting value
thetas = []
costs = []
# calculate cost of any point
def cost(theta):
x = theta[0]
y = theta[1]
return 100*x*math.exp(-0.5*x*x+0.5*x-0.5*y*y-y+math.pi)
def gradient(theta):
x = theta[0]
y = theta[1]
dx = 100*math.exp(-0.5*x*x+0.5*x-0.0035*y*y-y+math.pi)*(1+x*(-x + 0.5))
dy = 100*x*math.exp(-0.5*x*x+0.5*x-0.05*y*y-y+math.pi)*(-y-1)
gradients = np.array([dx,dy])
return gradients
#for 2 features
for step in range(iterations):
theta = theta - gamma*gradient(theta)
value = cost(theta)
thetas.append(theta)
costs.append(value)
thetas = np.array(thetas)
X = thetas[:,0]
Y = thetas[:,1]
Z = np.array(costs)
iterations = [num for num in range(iterations)]
plt.plot(Z)
plt.xlabel("num. iteration")
plt.ylabel("cost")

I strongly recommend you check whether or not your analytic gradient is working correcly by first evaluating it against a numerical gradient.
I.e make sure that your f'(x) = (f(x+h) - f(x)) / h for some small h.
After that, make sure your updates are actually in the right direction by picking a point where you know x or y should decrease and then checking the sign of your gradient function output.
Of course make sure your goal is actually minimization vs maximization.

Related

Implementing stochastic gradient descent

I am trying to implement a basic way of the stochastic gradient desecent with multi linear regression and the L2 Norm as loss function.
The result can be seen in this picture:
Its pretty far of the ideal regression line, but I dont really understand why thats the case. I double checked all array dimensions and they all seem to fit.
Below is my source code. If anyone can see my error or give me a hint I would appreciate that.
def SGD(x,y,learning_rate):
theta = np.array([[0],[0]])
for i in range(N):
xi = x[i].reshape(1,-1)
y_pre = xi#theta
theta = theta + learning_rate*(y[i]-y_pre[0][0])*xi.T
print(theta)
return theta
N = 100
x = np.array(np.linspace(-2,2,N))
y = 4*x + 5 + np.random.uniform(-1,1,N)
X = np.array([x**0,x**1]).T
plt.scatter(x,y,s=6)
th = SGD(X,y,0.1)
y_reg = np.matmul(X,th)
print(y_reg)
print(x)
plt.plot(x,y_reg)
plt.show()
Edit: Another solution was to shuffle the measurements with x = np.random.permutation(x)

to illustrate my comment,
def SGD(x,y,n,learning_rate):
theta = np.array([[0],[0]])
# currently it does exactly one iteration. do more
for _ in range(n):
for i in range(len(x)):
xi = x[i].reshape(1,-1)
y_pre = xi#theta
theta = theta + learning_rate*(y[i]-y_pre[0][0])*xi.T
print(theta)
return theta
SGD(X,y,10,0.01) yields the correct result

Weights explode in polynomial regression with gradient descent

I'm just starting out learning machine learning and have been trying to fit a polynomial to data generated with a sine curve. I know how to do this in closed form, but I'm trying to get it to work with gradient descent too.
However, my weights explode to crazy heights, even with a very large penalty term. What am I doing wrong?
Here is the code:
import numpy as np
import matplotlib.pyplot as plt
from math import pi
N = 10
D = 5
X = np.linspace(0,100, N)
Y = np.sin(0.1*X)*50
X = X.reshape(N, 1)
Xb = np.array([[1]*N]).T
for i in range(1, D):
Xb = np.concatenate((Xb, X**i), axis=1)
#Randomly initializie the weights
w = np.random.randn(D)/np.sqrt(D)
#Solving in closed form works
#w = np.linalg.solve((Xb.T.dot(Xb)),Xb.T.dot(Y))
#Yhat = Xb.dot(w)
#Gradient descent
learning_rate = 0.0001
for i in range(500):
Yhat = Xb.dot(w)
delta = Yhat - Y
w = w - learning_rate*(Xb.T.dot(delta) + 100*w)
print('Final w: ', w)
plt.scatter(X, Y)
plt.plot(X,Yhat)
plt.show()
Thanks!

When updating theta, you have to take theta and subtract it with the learning weight times the derivative of theta divided by the training set size. You also have to divide your penality term by the training size set. But the main problem is that your learning rate is too large. For future debugging, it is helpful to print the cost to see if gradient descent is working and if the learning rate is too small or just right.
Below here is the code for 2nd degree polynomial which the found the optimum thetas (as you can see the learning rate is really small). I've also added the cost function.
N = 2
D = 2
#Gradient descent
learning_rate = 0.000000000001
for i in range(200):
Yhat = Xb.dot(w)
delta = Yhat - Y
print((1/N) * np.sum(np.dot(delta, np.transpose(delta))))
w = w - learning_rate*(np.dot(delta, Xb)) * (1/N)

Cost value doesn't decrease when using gradient descent

I have data pairs (x,y) which are created by a cubic function
y = g(x) = ax^3 − bx^2 − cx + d
plus some random noise. Now, I want to fit a model (parameters a,b,c,d) to this data using gradient descent.
My implementation:
param={}
param["a"]=0.02
param["b"]=0.001
param["c"]=0.002
param["d"]=-0.04
def model(param,x,y,derivative=False):
x2=np.power(x,2)
x3=np.power(x,3)
y_hat = param["a"]*x3+param["b"]*x2+param["c"]*x+param["d"]
if derivative==False:
return y_hat
derv={} #of Cost function w.r.t parameters
m = len(y_hat)
derv["a"]=(2/m)*np.sum((y_hat-y)*x3)
derv["b"]=(2/m)*np.sum((y_hat-y)*x2)
derv["c"]=(2/m)*np.sum((y_hat-y)*x)
derv["d"]=(2/m)*np.sum((y_hat-y))
return derv
def cost(y_hat,y):
assert(len(y)==len(y_hat))
return (np.sum(np.power(y_hat-y,2)))/len(y)
def optimizer(param,x,y,lr=0.01,epochs = 100):
for i in range(epochs):
y_hat = model(param,x,y)
derv = model(param,x,y,derivative=True)
param["a"]=param["a"]-lr*derv["a"]
param["b"]=param["b"]-lr*derv["b"]
param["c"]=param["c"]-lr*derv["c"]
param["d"]=param["d"]-lr*derv["d"]
if i%10==0:
#print (y,y_hat)
#print(param,derv)
print(cost(y_hat,y))
X = np.array(x)
Y = np.array(y)
optimizer(param,X,Y,0.01,100)
When run, the cost seems to be increasing:
36.140028646153525
181.88127675295928
2045.7925570171055
24964.787906199843
306448.81623701524
3763271.7837247783
46215271.5069297
567552820.2134454
6969909237.010273
85594914704.25394
Did I compute the gradients wrong? I don't know why the cost is exploding.
Here is the data: https://pastebin.com/raw/1VqKazUV.

If I run your code with e.g. lr=1e-4, the cost decreases.
Check your gradients (just print the result of model(..., True)), you will see that they are quite large. As your learning rate is also not too small, you are likely oscillating away from the minimum (see any ML textbook for example plots of this, you should also be able to see this if you just print your parameters after every iteration).

Bad results from LMS stochastic gradient descent

I'm trying to adapt a batch gradient descent algorithm from a previous question to do stochastic gradient descent, my cost seems to get stuck pretty far from the minimum value (in the example, around 1750 when the minimum is around 1450). It would seem like once it reaches that value, it just starts oscillating there. I also tried to shuffle range(0, x.shape[0]-1) every l but it didn't make any difference. I expect oscillations around the optimal value, but this just seemed too far off, so I think there must be a mistake.
import numpy as np
y = np.asfarray([[400], [330], [369], [232], [540]])
x = np.asfarray([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x = np.concatenate((np.ones((5,1)), x), axis=1)
theta = np.asfarray([[0], [.5], [.5]])
fscale = np.sum(x, axis=0)
x /= fscale
alpha = .1
for l in range(1,100000):
for i in range(0, x.shape[0]-1):
h = np.dot(x, theta)
gradient = ((h[i:i+1] - y[i:i+1]) * x[i:i+1]).T
theta -= alpha * gradient
print ((h - y)**2).sum(), theta.squeeze() / fscale

LMS batch gradient descent with NumPy

I'm trying to write some very simple LMS batch gradient descent but I believe I'm doing something wrong with the gradient. The ratio between the order of magnitude and the initial values for theta is very different for the elements of theta so either theta[2] doesn't move (e.g. if alpha = 1e-8) or theta[1] shoots off (e.g. if alpha = .01).
import numpy as np
y = np.array([[400], [330], [369], [232], [540]])
x = np.array([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x = np.concatenate((np.ones((5,1), dtype=np.int), x), axis=1)
theta = np.array([[0.], [.1], [50.]])
alpha = .01
for i in range(1,1000):
h = np.dot(x, theta)
gradient = np.sum((h - y) * x, axis=0, keepdims=True).transpose()
theta -= alpha * gradient
print ((h - y)**2).sum(), theta.squeeze().tolist()

The algorithm as written is completely correct, but without feature scaling, convergence will be extremely slow as one feature will govern the gradient calculation.
You can perform the scaling in various ways; for now, let us just scale the features by their L^1 norms because it's simple
import numpy as np
y = np.array([[400], [330], [369], [232], [540]])
x_orig = np.array([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x_orig = np.concatenate((np.ones((5,1), dtype=np.int), x_orig), axis=1)
x_norm = np.sum(x_orig, axis=0)
x = x_orig / x_norm
That is, the sum of every column in x is 1. If you want to retain your good guess at the correct parameters, those have to be scaled accordingly.
theta = (x_norm*[0., .1, 50.]).reshape(3, 1)
With this, we may proceed as you did in your original post, where again you will have to play around with the learning rate until you find a sweet spot.
alpha = .1
for i in range(1, 100000):
h = np.dot(x, theta)
gradient = np.sum((h - y) * x, axis=0, keepdims=True).transpose()
theta -= alpha * gradient
Let's see what we get now that we've found something that seems to converge. Again, your parameters will have to be scaled to relate to the original unscaled features.
print (((h - y)**2).sum(), theta.squeeze()/x_norm)
# Prints 1444.14443271 [ -7.04344646e+01 6.38435468e-02 1.03435881e+02]
At this point, let's cheat and check our results
theta, error, _, _ = np.linalg.lstsq(x_orig, y)
print(error, theta)
# Prints [ 1444.1444327] [[ -7.04346018e+01]
# [ 6.38433756e-02]
# [ 1.03436047e+02]]
A general introductory reference on feature scaling is this Stanford lecture.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Code Not Converging Vanilla Gradient Descent - python

Related

Implementing stochastic gradient descent

Weights explode in polynomial regression with gradient descent

Cost value doesn't decrease when using gradient descent

Bad results from LMS stochastic gradient descent

LMS batch gradient descent with NumPy

Categories

Resources