I'm new in machine learning. I started from linear regression with gradient descent. I have python code for this and I understad this way. My question is: Gradient descent algorithm minimize function, can I plot this function? I want to see what the function in which the minimum is looked like. It possible?
My code:
import matplotlib.pyplot as plt import numpy as np
def sigmoid_activation(x):
return 1.0 / (1 + np.exp(-x))
X = np.array([
[2.13, 5.49],
[8.35, 6.74],
[8.17, 5.79],
[0.62, 8.54],
[2.74, 6.92] ])
y = [0, 1, 1, 0, 0]
xdata = [row[0] for row in X] ydata = [row[1] for row in X]
X = np.c_[np.ones((X.shape[0])), X] W = np.random.uniform(size=(X.shape[1], ))
lossHistory = []
for epoch in np.arange(0, 5):
preds = sigmoid_activation(X.dot(W))
error = preds - y
loss = np.sum(error ** 2)
lossHistory.append(loss)
gradient = X.T.dot(error) / X.shape[0]
W += - 0.44 * gradient
plt.scatter(xdata, ydata) plt.show()
plt.plot(np.arange(0, 5), lossHistory) plt.show()
for i in np.random.choice(5, 5):
activation = sigmoid_activation(X[i].dot(W))
label = 0 if activation < 0.5 else 1
print("activation={:.4f}; predicted_label={}, true_label={}".format(
activation, label, y[i]))
Y = (-W[0] - (W[1] * X)) / W[2]
plt.scatter(X[:, 1], X[:, 2], c=y) plt.plot(X, Y, "r-") plt.show()
With the risk of being obvious... You can simply plot lossHistory with matplotlib. Or am I missing something?
EDIT: apparently the OP asked what the Gradient Descent (GD) is minimizing. I will try to answer here and I hope I can answer the original question.
The GD algorithm is a generic algorithm to find the minimum of a function in parameter space. In your case (and that is how is usually used with Neural Networks) you want to find the minimum of a loss function: the MSE (Mean Squared Error). You implement the GD algorithm updating the weights as you did with
gradient = X.T.dot(error) / X.shape[0]
W += - 0.44 * gradient
The gradient is just the partial derivative of your loss function (the MSE) with respect to the weights. So are effectively minimizing the loss function (MSE). Then you update your weights with a learning rate of 0.44.
Then you simply save the value of your loss function in the array
loss = np.sum(error ** 2)
lossHistory.append(loss)
and therefore the lossHistory array contains your cost (or loss) function that you can plot to check your learning process. The plot should show something decreasing. Does this explanation help you?
Best,
Umberto
Related
I'm facing some issues trying to find the linear regression line using Gradient Descent, getting to weird results.
Here is the function:
def gradient_descent(m_k, c_k, learning_rate, points):
n = len(points)
dm, dc = 0, 0
for i in range(n):
x = points.iloc[i]['alcohol']
y = points.iloc[i]['total']
dm += -(2/n) * x * (y - (m_k * x + c_k)) # Partial der in m
dc += -(2/n) * (y - (m_k * x + c_k)) # Partial der in c
m = m_k - dm * learning_rate
c = c_k - dc * learning_rate
return m, c
And combined with a for loop
l_rate = 0.0001
m, c = 0, 0
epochs = 1000
for _ in range(epochs):
m, c = gradient_descent(m, c, l_rate, dataset)
plt.scatter(dataset.alcohol, dataset.total)
plt.plot(list(range(2, 10)), [m * x + c for x in range(2,10)], color='red')
plt.show()
Gives this result:
Slope: 2.8061974241244196
Y intercept: 0.5712221080810446
The problem is though that taking advantage of sklearn to compute the slope and intercept, i.e.
model = LinearRegression(fit_intercept=True).fit(np.array(dataset['alcohol']).copy().reshape(-1, 1),
np.array(dataset['total']).copy())
I get something completely different:
Slope: 2.0325063
Intercept: 5.8577761548263005
Any idea why? Looking on SO I've found out that a possible problem could be a too high learning rate, but as stated above I'm currently using 0.0001
Sklearn's LinearRegression doesn't use gradient descent - it uses Ordinary Least Squares (OLS) Regression which is a non-iterative method.
For your model, you might consider randomly initialising m, c rather than starting with 0,0. You could also consider adjusting the learning rate or using an adaptive learning rate.
I am trying to plot gradient descent cost_list with respect to epoch, but when I am trying to do so, I am getting lost with basic python function structure. I am appending my code structure what I am trying to do.
def gradientDescent(x, y, theta, alpha, m, numIterations):
xTrans = x.T
cost_list=[]
for i in range(0, numIterations):
hypothesis = np.dot(x, theta)
loss = hypothesis - y
cost = np.sum(loss ** 2) / (2 * m)
cost_list.append(cost)
print("Iteration %d | Cost: %f" % (i, cost))
# avg gradient per example
gradient = np.dot(xTrans, loss) / m
# update
theta = theta - alpha * gradient
#a = plt.plot(i,theta)
return theta,cost_list
what I am trying to do is I am return the "cost_list" at each step and creating a list of cost and I am trying to plot now with the below Line of codes.
theta,cost_list=gradientDescent(x,y,bias,0.000001,len(my dataframe),100)
plt.plot(list(range(numIterations)), cost_list, '-r')
but it's giving me error with numIterations not defined.
what should be the possible edit to the code
I tried your code with sample data;
df = pd.DataFrame(np.random.randint(1,50, size=(50,2)), columns=list('AB'))
x=df.A
y=df.B
bias = np.random.randn(50,1)
numIterations = 100
theta,cost_list=gradientDescent(x,y,bias,0.000001,len(df),numIterations)
plt.plot(list(range(numIterations)), cost_list, '-r')
I'm just starting out learning machine learning and have been trying to fit a polynomial to data generated with a sine curve. I know how to do this in closed form, but I'm trying to get it to work with gradient descent too.
However, my weights explode to crazy heights, even with a very large penalty term. What am I doing wrong?
Here is the code:
import numpy as np
import matplotlib.pyplot as plt
from math import pi
N = 10
D = 5
X = np.linspace(0,100, N)
Y = np.sin(0.1*X)*50
X = X.reshape(N, 1)
Xb = np.array([[1]*N]).T
for i in range(1, D):
Xb = np.concatenate((Xb, X**i), axis=1)
#Randomly initializie the weights
w = np.random.randn(D)/np.sqrt(D)
#Solving in closed form works
#w = np.linalg.solve((Xb.T.dot(Xb)),Xb.T.dot(Y))
#Yhat = Xb.dot(w)
#Gradient descent
learning_rate = 0.0001
for i in range(500):
Yhat = Xb.dot(w)
delta = Yhat - Y
w = w - learning_rate*(Xb.T.dot(delta) + 100*w)
print('Final w: ', w)
plt.scatter(X, Y)
plt.plot(X,Yhat)
plt.show()
Thanks!
When updating theta, you have to take theta and subtract it with the learning weight times the derivative of theta divided by the training set size. You also have to divide your penality term by the training size set. But the main problem is that your learning rate is too large. For future debugging, it is helpful to print the cost to see if gradient descent is working and if the learning rate is too small or just right.
Below here is the code for 2nd degree polynomial which the found the optimum thetas (as you can see the learning rate is really small). I've also added the cost function.
N = 2
D = 2
#Gradient descent
learning_rate = 0.000000000001
for i in range(200):
Yhat = Xb.dot(w)
delta = Yhat - Y
print((1/N) * np.sum(np.dot(delta, np.transpose(delta))))
w = w - learning_rate*(np.dot(delta, Xb)) * (1/N)
I'm trying to adapt a batch gradient descent algorithm from a previous question to do stochastic gradient descent, my cost seems to get stuck pretty far from the minimum value (in the example, around 1750 when the minimum is around 1450). It would seem like once it reaches that value, it just starts oscillating there. I also tried to shuffle range(0, x.shape[0]-1) every l but it didn't make any difference. I expect oscillations around the optimal value, but this just seemed too far off, so I think there must be a mistake.
import numpy as np
y = np.asfarray([[400], [330], [369], [232], [540]])
x = np.asfarray([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x = np.concatenate((np.ones((5,1)), x), axis=1)
theta = np.asfarray([[0], [.5], [.5]])
fscale = np.sum(x, axis=0)
x /= fscale
alpha = .1
for l in range(1,100000):
for i in range(0, x.shape[0]-1):
h = np.dot(x, theta)
gradient = ((h[i:i+1] - y[i:i+1]) * x[i:i+1]).T
theta -= alpha * gradient
print ((h - y)**2).sum(), theta.squeeze() / fscale
I'm trying to write some very simple LMS batch gradient descent but I believe I'm doing something wrong with the gradient. The ratio between the order of magnitude and the initial values for theta is very different for the elements of theta so either theta[2] doesn't move (e.g. if alpha = 1e-8) or theta[1] shoots off (e.g. if alpha = .01).
import numpy as np
y = np.array([[400], [330], [369], [232], [540]])
x = np.array([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x = np.concatenate((np.ones((5,1), dtype=np.int), x), axis=1)
theta = np.array([[0.], [.1], [50.]])
alpha = .01
for i in range(1,1000):
h = np.dot(x, theta)
gradient = np.sum((h - y) * x, axis=0, keepdims=True).transpose()
theta -= alpha * gradient
print ((h - y)**2).sum(), theta.squeeze().tolist()
The algorithm as written is completely correct, but without feature scaling, convergence will be extremely slow as one feature will govern the gradient calculation.
You can perform the scaling in various ways; for now, let us just scale the features by their L^1 norms because it's simple
import numpy as np
y = np.array([[400], [330], [369], [232], [540]])
x_orig = np.array([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x_orig = np.concatenate((np.ones((5,1), dtype=np.int), x_orig), axis=1)
x_norm = np.sum(x_orig, axis=0)
x = x_orig / x_norm
That is, the sum of every column in x is 1. If you want to retain your good guess at the correct parameters, those have to be scaled accordingly.
theta = (x_norm*[0., .1, 50.]).reshape(3, 1)
With this, we may proceed as you did in your original post, where again you will have to play around with the learning rate until you find a sweet spot.
alpha = .1
for i in range(1, 100000):
h = np.dot(x, theta)
gradient = np.sum((h - y) * x, axis=0, keepdims=True).transpose()
theta -= alpha * gradient
Let's see what we get now that we've found something that seems to converge. Again, your parameters will have to be scaled to relate to the original unscaled features.
print (((h - y)**2).sum(), theta.squeeze()/x_norm)
# Prints 1444.14443271 [ -7.04344646e+01 6.38435468e-02 1.03435881e+02]
At this point, let's cheat and check our results
theta, error, _, _ = np.linalg.lstsq(x_orig, y)
print(error, theta)
# Prints [ 1444.1444327] [[ -7.04346018e+01]
# [ 6.38433756e-02]
# [ 1.03436047e+02]]
A general introductory reference on feature scaling is this Stanford lecture.