Linear regression using Gradient Descent - python

I'm facing some issues trying to find the linear regression line using Gradient Descent, getting to weird results.
Here is the function:
def gradient_descent(m_k, c_k, learning_rate, points):
n = len(points)
dm, dc = 0, 0
for i in range(n):
x = points.iloc[i]['alcohol']
y = points.iloc[i]['total']
dm += -(2/n) * x * (y - (m_k * x + c_k)) # Partial der in m
dc += -(2/n) * (y - (m_k * x + c_k)) # Partial der in c
m = m_k - dm * learning_rate
c = c_k - dc * learning_rate
return m, c
And combined with a for loop
l_rate = 0.0001
m, c = 0, 0
epochs = 1000
for _ in range(epochs):
m, c = gradient_descent(m, c, l_rate, dataset)
plt.scatter(dataset.alcohol, dataset.total)
plt.plot(list(range(2, 10)), [m * x + c for x in range(2,10)], color='red')
plt.show()
Gives this result:
Slope: 2.8061974241244196
Y intercept: 0.5712221080810446
The problem is though that taking advantage of sklearn to compute the slope and intercept, i.e.
model = LinearRegression(fit_intercept=True).fit(np.array(dataset['alcohol']).copy().reshape(-1, 1),
np.array(dataset['total']).copy())
I get something completely different:
Slope: 2.0325063
Intercept: 5.8577761548263005
Any idea why? Looking on SO I've found out that a possible problem could be a too high learning rate, but as stated above I'm currently using 0.0001

Sklearn's LinearRegression doesn't use gradient descent - it uses Ordinary Least Squares (OLS) Regression which is a non-iterative method.
For your model, you might consider randomly initialising m, c rather than starting with 0,0. You could also consider adjusting the learning rate or using an adaptive learning rate.

Related

Neural Network Backpropogation code not working

I need to write a simple neural network that consists of 1 output node, one hidden layer of 3 nodes, and 1 input layer (variable size). For now I am just trying to train on the xor data so lets presume that there are 3 input nodes (one node represents the bias and is always 1). The data is labeled 0,1.
I did out the equations for backpropogation and found that despite being so simple, my code does not converge to the xor data being correct.
Let W be the 3x3 matrix of weights connecting the input and hidden layer, and w be the 1x3 matrix that connects the hidden to output layer. Here are some helper functions for my method
def feed_forward_predict(x, W, w):
sigmoid = lambda x: 1/(1+np.exp(-x))
z = np.array(list(map(sigmoid, np.matmul(W, x))))
L = sigmoid(np.matmul(w, z))
return [L, z, x]
this just takes in a value and makes a prediction using the formula sig(w*sig(W*x)). We also have
def calculate_objective(data, labels, W, w):
obj = 0
for point, label in zip(data, labels):
L, z, x = feed_forward_predict(point, W, w)
obj += (label - L)**2
return obj
which calculates the Mean Squared Error for a bunch of given data points. Both of these functions should work as I checked them by hand. Now the problem comes in for the back propogation algorithm
def back_prop(traindata, trainlabels):
sigmoid = lambda x: 1/(1+np.exp(-x))
sigmoid_prime = lambda x: np.exp(-x)/((1+np.exp(-x))**2)
W = np.random.rand(3, len(traindata[0]))
w = np.random.rand(1, 3)
obj = calculate_objective(traindata, trainlabels, W, w)
print(obj)
epochs = 10_000
eta = .01
prevobj = np.inf
i=0
while(i < epochs):
prevobj = obj
dellw = np.zeros((1,3))
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
dellw += 2*(y - label) * sigmoid_prime(np.dot(w, z)) * z
w -= eta * dellw
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
temp = 2 * (y - label) * sigmoid_prime(np.dot(w, z))
# Note that s,u,v represent the hidden node weights. My professor required it this way
dells = temp * w[0][0] * sigmoid_prime(np.matmul(W[0,:], x)) * x
dellu = temp * w[0][1] * sigmoid_prime(np.matmul(W[1,:], x)) * x
dellv = temp * w[0][2] * sigmoid_prime(np.matmul(W[2,:], x)) * x
dellW = np.array([dells, dellu, dellv])
W -= eta*dellW
obj = calculate_objective(traindata, trainlabels, W, w)
i = i + 1
print("i=", i, " Objective=",obj)
return [W, w]
However this code, despite seemingly being correct in terms of the matrix multiplications and derivatives I took, does not converge to anything. In fact the error consistantly bounces: it will fall, then rise, then fall back to the same spot, then rise again. I believe that the problem lies with the W matrix gradient but I do not know what exactly it is.
If you'd like to see for yourself what is happening, the input data I used is
0: 0 0 1
0: 1 1 1
1: 1 0 1
1: 0 1 1
where the first number represents the label. I also set the random seed to np.random.seed(0) just so that I could be consistant with my matrices I'm dealing with.
It appears you are attempting to setup a manual version of stochastic gradient decent with a fixed learning rate (a classic NN problem).
Some notes on your code. It is very difficult to follow all the steps you are doing with so much loops and inconsistencies. In general, it defeats the purpose of using np.array() if you are using loops. Likewise you should know that np.matmul() is * and np.dot() is #. It is unclear how you are using the derivative. You have it explicitly stated at the start for the activation function and then partially derived in the middle of your loop for the MSE. Ugh.
Some other pointers. Explicitly state all your functions and your data, those should be globals. Those should also be derived all at once based on your fixed data as np.array(). In particular, note that while traditional statistics (like finding the line of best fit) means that we are solving for a fixed set of weights given a random variable; in stochastic gradient decent, we are doing the opposite. We are instead fixing the random variable to our data and optimizing our weights. Hence, your functions should only have your weights as "free variables", everything else is fixed. It is important to follow what is being fixed and what is free to update. Your code does not reflect that you know what is being update and what is fixed.
SGD algorithm outline:
Random params.
Update params by moving params a small percentage in the direction of lowest decent.
Run step (2) for a specified amount of time.
Print your params.
Example of SGD code (here is an example of performing SGD to find the line of best fit for some data).
import numpy as np
#Data
X = np.random.random((100,)) #Random points
Y = (2.3*X + 8) + 0.1*np.random.random((100,)) #Linear model + Noise
#Functions (only free variable is the params) (we want the F of best fit under MSE)
F = lambda p : p[0]*X+p[1]
dF = lambda p : np.array([X,np.ones(X.shape)])
MSE = lambda p : (1/Y.shape[0])*((Y-F(p))**2).sum(0)
dMSE = lambda p : (1/Y.shape[0])*(-2*(Y-F(p))*dF(p)).sum(1)
#SGD loop
lr = 0.05
epochs = 1000
params = np.array([0.0,0.0])
for i in range(epochs):
params -= lr*dMSE(params)
print(params)
Hopefully, written this way it is super clear exactly where the subtraction of the gradient is occurring and exactly how it is calculated. Note also, in case it wasn't clear, the derivative in both dF and dMSE is with respect to the params. Obviously this is a toy problem that can be solved explicitly with the scipy module. Hence, SGD is a clearly useless way to optimize two variables.
from scipy.stats import linregress
params = linregress(X,Y)
print(params)
I think I figured it out, in my code I was not summing the hidden node weight derivatives and instead was assigning at every loop iteration. The correct version would be as follow
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
temp = 2 * (y - label) * sigmoid_prime(np.dot(w, z))
# Note that s,u,v represent the hidden node weights. My professor required it this way
dells += temp * w[0][0] * sigmoid_prime(np.matmul(W[0,:], x)) * x
dellu += temp * w[0][1] * sigmoid_prime(np.matmul(W[1,:], x)) * x
dellv += temp * w[0][2] * sigmoid_prime(np.matmul(W[2,:], x)) * x

Gradient descent derivation for orthogonal linear regression from scratch in Numpy Python

I wanted to derive gradient descent from scratch using the error function for Orthogonal linear regression. So that I could write python code in NumPy.
However, I am unsure about the error function in this.
y = W0 + W1 * X
Can anyone help me derive this!!
Thanks in advance.
Let's say you have y = m*x + p with (m; p) being the parameters you are tuning.
When you feed an observation sample X of X' through your model, you obtain a vector Y'. Then, you can compute your loss, say a mean square error, as e = 1 / n * sum_i (Yi - Yi')²' where Y is the y-coordinates associated to the x-coordinates in X.
Now, we want to differentiate e relative to your parameters:
de / dm = -2 / sigma² * sum_i ( Xi' * ( Yi - p - m * Xi' ) )
de / dp = -2 / sigma² * sum_i ( Yi - p - m * Xi' )
Then you want to minimize e thanks to this gradient:
m <- m - lr * de / dm
p <- p - lr * de / dp
With lr being your learning rate.
Do the math on your own though, just in case I made an error.
EDIT: I've implemented it quickly using those formulas and the math seems ok (at least it converges to the solution quickly)

Weights explode in polynomial regression with gradient descent

I'm just starting out learning machine learning and have been trying to fit a polynomial to data generated with a sine curve. I know how to do this in closed form, but I'm trying to get it to work with gradient descent too.
However, my weights explode to crazy heights, even with a very large penalty term. What am I doing wrong?
Here is the code:
import numpy as np
import matplotlib.pyplot as plt
from math import pi
N = 10
D = 5
X = np.linspace(0,100, N)
Y = np.sin(0.1*X)*50
X = X.reshape(N, 1)
Xb = np.array([[1]*N]).T
for i in range(1, D):
Xb = np.concatenate((Xb, X**i), axis=1)
#Randomly initializie the weights
w = np.random.randn(D)/np.sqrt(D)
#Solving in closed form works
#w = np.linalg.solve((Xb.T.dot(Xb)),Xb.T.dot(Y))
#Yhat = Xb.dot(w)
#Gradient descent
learning_rate = 0.0001
for i in range(500):
Yhat = Xb.dot(w)
delta = Yhat - Y
w = w - learning_rate*(Xb.T.dot(delta) + 100*w)
print('Final w: ', w)
plt.scatter(X, Y)
plt.plot(X,Yhat)
plt.show()
Thanks!
When updating theta, you have to take theta and subtract it with the learning weight times the derivative of theta divided by the training set size. You also have to divide your penality term by the training size set. But the main problem is that your learning rate is too large. For future debugging, it is helpful to print the cost to see if gradient descent is working and if the learning rate is too small or just right.
Below here is the code for 2nd degree polynomial which the found the optimum thetas (as you can see the learning rate is really small). I've also added the cost function.
N = 2
D = 2
#Gradient descent
learning_rate = 0.000000000001
for i in range(200):
Yhat = Xb.dot(w)
delta = Yhat - Y
print((1/N) * np.sum(np.dot(delta, np.transpose(delta))))
w = w - learning_rate*(np.dot(delta, Xb)) * (1/N)

LMS batch gradient descent with NumPy

I'm trying to write some very simple LMS batch gradient descent but I believe I'm doing something wrong with the gradient. The ratio between the order of magnitude and the initial values for theta is very different for the elements of theta so either theta[2] doesn't move (e.g. if alpha = 1e-8) or theta[1] shoots off (e.g. if alpha = .01).
import numpy as np
y = np.array([[400], [330], [369], [232], [540]])
x = np.array([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x = np.concatenate((np.ones((5,1), dtype=np.int), x), axis=1)
theta = np.array([[0.], [.1], [50.]])
alpha = .01
for i in range(1,1000):
h = np.dot(x, theta)
gradient = np.sum((h - y) * x, axis=0, keepdims=True).transpose()
theta -= alpha * gradient
print ((h - y)**2).sum(), theta.squeeze().tolist()
The algorithm as written is completely correct, but without feature scaling, convergence will be extremely slow as one feature will govern the gradient calculation.
You can perform the scaling in various ways; for now, let us just scale the features by their L^1 norms because it's simple
import numpy as np
y = np.array([[400], [330], [369], [232], [540]])
x_orig = np.array([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x_orig = np.concatenate((np.ones((5,1), dtype=np.int), x_orig), axis=1)
x_norm = np.sum(x_orig, axis=0)
x = x_orig / x_norm
That is, the sum of every column in x is 1. If you want to retain your good guess at the correct parameters, those have to be scaled accordingly.
theta = (x_norm*[0., .1, 50.]).reshape(3, 1)
With this, we may proceed as you did in your original post, where again you will have to play around with the learning rate until you find a sweet spot.
alpha = .1
for i in range(1, 100000):
h = np.dot(x, theta)
gradient = np.sum((h - y) * x, axis=0, keepdims=True).transpose()
theta -= alpha * gradient
Let's see what we get now that we've found something that seems to converge. Again, your parameters will have to be scaled to relate to the original unscaled features.
print (((h - y)**2).sum(), theta.squeeze()/x_norm)
# Prints 1444.14443271 [ -7.04344646e+01 6.38435468e-02 1.03435881e+02]
At this point, let's cheat and check our results
theta, error, _, _ = np.linalg.lstsq(x_orig, y)
print(error, theta)
# Prints [ 1444.1444327] [[ -7.04346018e+01]
# [ 6.38433756e-02]
# [ 1.03436047e+02]]
A general introductory reference on feature scaling is this Stanford lecture.

Multi variable gradient descent

I am learning gradient descent for calculating coefficients. Below is what I am doing:
#!/usr/bin/Python
import numpy as np
# m denotes the number of examples here, not the number of features
def gradientDescent(x, y, theta, alpha, m, numIterations):
xTrans = x.transpose()
for i in range(0, numIterations):
hypothesis = np.dot(x, theta)
loss = hypothesis - y
# avg cost per example (the 2 in 2*m doesn't really matter here.
# But to be consistent with the gradient, I include it)
cost = np.sum(loss ** 2) / (2 * m)
#print("Iteration %d | Cost: %f" % (i, cost))
# avg gradient per example
gradient = np.dot(xTrans, loss) / m
# update
theta = theta - alpha * gradient
return theta
X = np.array([41.9,43.4,43.9,44.5,47.3,47.5,47.9,50.2,52.8,53.2,56.7,57.0,63.5,65.3,71.1,77.0,77.8])
y = np.array([251.3,251.3,248.3,267.5,273.0,276.5,270.3,274.9,285.0,290.0,297.0,302.5,304.5,309.3,321.7,330.7,349.0])
n = np.max(X.shape)
x = np.vstack([np.ones(n), X]).T
m, n = np.shape(x)
numIterations= 100000
alpha = 0.0005
theta = np.ones(n)
theta = gradientDescent(x, y, theta, alpha, m, numIterations)
print(theta)
Now my above code works fine. If I now try multiple variables and replace X with X1 like the following:
X1 = np.array([[41.9,43.4,43.9,44.5,47.3,47.5,47.9,50.2,52.8,53.2,56.7,57.0,63.5,65.3,71.1,77.0,77.8], [29.1,29.3,29.5,29.7,29.9,30.3,30.5,30.7,30.8,30.9,31.5,31.7,31.9,32.0,32.1,32.5,32.9]])
then my code fails and shows me the following error:
JustTestingSGD.py:14: RuntimeWarning: overflow encountered in square
cost = np.sum(loss ** 2) / (2 * m)
JustTestingSGD.py:19: RuntimeWarning: invalid value encountered in subtract
theta = theta - alpha * gradient
[ nan nan nan]
Can anybody tell me how can I do gradient descent using X1? My expected output using X1 is:
[-153.5 1.24 12.08]
I am open to other Python implementations also. I just want the coefficients (also called thetas) for X1 and y.
The problem is in your algorithm not converging. It diverges instead. The first error:
JustTestingSGD.py:14: RuntimeWarning: overflow encountered in square
cost = np.sum(loss ** 2) / (2 * m)
comes from the problem that at some point calculating the square of something is impossible, as the 64-bit floats cannot hold the number (i.e. it is > 10^309).
JustTestingSGD.py:19: RuntimeWarning: invalid value encountered in subtract
theta = theta - alpha * gradient
This is only a consequence of the error before. The numbers are not reasonable for calculations.
You can actually see the divergence by uncommenting your debug print line. The cost starts to grow, as there is no convergence.
If you try your function with X1 and a smaller value for alpha, it converges.

Categories

Resources