I am implementing gradient descent for regression using newtons method as explained in the 8.3 section of the Machine Learning A Probabilistic Perspective (Murphy) book. I am working with two dimensional data in this implementation. I am using following notations.
x = input data points m*2
y = labelled outputs(m) corresponding to input data
H = Hessian matrix is defined as
gradient descent update
where loss function is defined as
In my case
is array and H is
Here is my python implementation. However this is not working as cost is increasing in each iteration.
def loss(x,y,theta):
m,n = np.shape(x)
cost_list = []
for i in xrange(0,n):
x_0 = x[:,i].reshape((m,1))
predicted = np.dot(x_0, theta[i])
error = predicted - y
cost = np.sum(error ** 2) / m
cost_list.append(cost)
cost_list = np.array(cost_list).reshape((2,1))
return cost_list
def NewtonMethod(x,y,theta,maxIterations):
m,n = np.shape(x)
xTrans = x.transpose()
H = 2 * np.dot(xTrans,x) / m
Hinv = np.linalg.inv(H)
thetaPrev = np.zeros_like(theta)
best_iter = maxIterations
for i in range(0,maxIterations):
cost = loss(x,y,theta)
theta = theta - np.dot(Hinv,cost))
if(np.allclose(theta,thetaPrev,rtol=0.001,atol=0.001)):
break;
else:
thetaPrev = theta
best_iter = i
return theta
Here are the sample values I used
import numpy as np
x = np.array([[-1.7, -1.5],[-1.0 , -0.3],[ 1.7 , 1.5],[-1.2, -0.7 ][ 0.6, 0.1]])
y = np.array([ 0.3 , 0.07, -0.2, 0.07, 0.03 ])
theta = np.zeros(2)
NewtonMethod(x,y,theta,100)
Need help / suggestions to fix this problem.
Thanks
You are effectively using a step size of 1. Try reducing the step size and see if that helps. That is, Instead of
Do this:
with a smaller value than 1.
Related
I'm facing some issues trying to find the linear regression line using Gradient Descent, getting to weird results.
Here is the function:
def gradient_descent(m_k, c_k, learning_rate, points):
n = len(points)
dm, dc = 0, 0
for i in range(n):
x = points.iloc[i]['alcohol']
y = points.iloc[i]['total']
dm += -(2/n) * x * (y - (m_k * x + c_k)) # Partial der in m
dc += -(2/n) * (y - (m_k * x + c_k)) # Partial der in c
m = m_k - dm * learning_rate
c = c_k - dc * learning_rate
return m, c
And combined with a for loop
l_rate = 0.0001
m, c = 0, 0
epochs = 1000
for _ in range(epochs):
m, c = gradient_descent(m, c, l_rate, dataset)
plt.scatter(dataset.alcohol, dataset.total)
plt.plot(list(range(2, 10)), [m * x + c for x in range(2,10)], color='red')
plt.show()
Gives this result:
Slope: 2.8061974241244196
Y intercept: 0.5712221080810446
The problem is though that taking advantage of sklearn to compute the slope and intercept, i.e.
model = LinearRegression(fit_intercept=True).fit(np.array(dataset['alcohol']).copy().reshape(-1, 1),
np.array(dataset['total']).copy())
I get something completely different:
Slope: 2.0325063
Intercept: 5.8577761548263005
Any idea why? Looking on SO I've found out that a possible problem could be a too high learning rate, but as stated above I'm currently using 0.0001
Sklearn's LinearRegression doesn't use gradient descent - it uses Ordinary Least Squares (OLS) Regression which is a non-iterative method.
For your model, you might consider randomly initialising m, c rather than starting with 0,0. You could also consider adjusting the learning rate or using an adaptive learning rate.
I am trying to use PYMC3 for a Bayesian model where I would like to repeatedly train my model on new unseen data. I am thinking I would need to update the priors with the posterior of the previously trained model every time I see the data, similar to how is achieved here https://docs.pymc.io/notebooks/updating_priors.html. They use the following function that finds the KDE from the samples and replacing each of the original definitions of the parameters in the model with a call to from_posterior.
def from_posterior(param, samples):
smin, smax = np.min(samples), np.max(samples)
width = smax - smin
x = np.linspace(smin, smax, 100)
y = stats.gaussian_kde(samples)(x)
# what was never sampled should have a small probability but not 0,
# so we'll extend the domain and use linear approximation of density on it
x = np.concatenate([[x[0] - 3 * width], x, [x[-1] + 3 * width]])
y = np.concatenate([[0], y, [0]])
return Interpolated(param, x, y)
And here is my original model.
def create_model(batsmen, bowlers, id1, id2, X):
testval = [[-5,0,1,2,3.5,5] for i in range(0, 9)]
l = [i for i in range(9)]
model = pm.Model()
with model:
delta_1 = pm.Uniform("delta_1", lower=0, upper=1)
delta_2 = pm.Uniform("delta_2", lower=0, upper=1)
inv_sigma_sqr = pm.Gamma("sigma^-2", alpha=1.0, beta=1.0)
inv_tau_sqr = pm.Gamma("tau^-2", alpha=1.0, beta=1.0)
mu_1 = pm.Normal("mu_1", mu=0, sigma=1/pm.math.sqrt(inv_tau_sqr), shape=len(batsmen))
mu_2 = pm.Normal("mu_2", mu=0, sigma=1/pm.math.sqrt(inv_tau_sqr), shape=len(bowlers))
delta = pm.math.ge(l, 3) * delta_1 + pm.math.ge(l, 6) * delta_2
eta = [pm.Deterministic("eta_" + str(i), delta[i] + mu_1[id1[i]] - mu_2[id2[i]]) for i in range(9)]
cutpoints = pm.Normal("cutpoints", mu=0, sigma=1/pm.math.sqrt(inv_sigma_sqr), transform=pm.distributions.transforms.ordered, shape=(9,6), testval=testval)
X_ = [pm.OrderedLogistic("X_" + str(i), cutpoints=cutpoints[i], eta=eta[i], observed=X[i]-1) for i in range(9)]
return model
Here, the problem is that some of my parameters such as mu_1, are multidimensional. This is why I get the following error:
ValueError: points have dimension 1, dataset has dimension 1500
because of the line y = stats.gaussian_kde(samples)(x).
Can someone please help me make this work for multi-dimensional parameters? I don't properly understand what KDE is and how the code computes it.
Thank you in advance!!
I'm just starting out learning machine learning and have been trying to fit a polynomial to data generated with a sine curve. I know how to do this in closed form, but I'm trying to get it to work with gradient descent too.
However, my weights explode to crazy heights, even with a very large penalty term. What am I doing wrong?
Here is the code:
import numpy as np
import matplotlib.pyplot as plt
from math import pi
N = 10
D = 5
X = np.linspace(0,100, N)
Y = np.sin(0.1*X)*50
X = X.reshape(N, 1)
Xb = np.array([[1]*N]).T
for i in range(1, D):
Xb = np.concatenate((Xb, X**i), axis=1)
#Randomly initializie the weights
w = np.random.randn(D)/np.sqrt(D)
#Solving in closed form works
#w = np.linalg.solve((Xb.T.dot(Xb)),Xb.T.dot(Y))
#Yhat = Xb.dot(w)
#Gradient descent
learning_rate = 0.0001
for i in range(500):
Yhat = Xb.dot(w)
delta = Yhat - Y
w = w - learning_rate*(Xb.T.dot(delta) + 100*w)
print('Final w: ', w)
plt.scatter(X, Y)
plt.plot(X,Yhat)
plt.show()
Thanks!
When updating theta, you have to take theta and subtract it with the learning weight times the derivative of theta divided by the training set size. You also have to divide your penality term by the training size set. But the main problem is that your learning rate is too large. For future debugging, it is helpful to print the cost to see if gradient descent is working and if the learning rate is too small or just right.
Below here is the code for 2nd degree polynomial which the found the optimum thetas (as you can see the learning rate is really small). I've also added the cost function.
N = 2
D = 2
#Gradient descent
learning_rate = 0.000000000001
for i in range(200):
Yhat = Xb.dot(w)
delta = Yhat - Y
print((1/N) * np.sum(np.dot(delta, np.transpose(delta))))
w = w - learning_rate*(np.dot(delta, Xb)) * (1/N)
I created a function to calculate the parameters of a logarithm-function.
My aim is to predict the future results of data points that follow a logarithm function. But what is the most important is that my algorithm fits the last results better than the whole data points as it is the prediction that matters. I currently use Mean Squared Error to optimize my parameters but I do not know how to weight it such as it takes my most recent data points as more important than the first ones.
Here is my equation:
y = C * log( a * x + b )
Here is my code:
import numpy as np
from sklearn.metrics import mean_squared_error
def approximate_log_function(x, y):
C = np.arange(0.01, 1, step = 0.01)
a = np.arange(0.01, 1, step = 0.01)
b = np.arange(0.01, 1, step = 0.01)
min_mse = 9999999999
parameters = [0, 0, 0]
for i in np.array(np.meshgrid(C, a, b)).T.reshape(-1, 3):
y_estimation = i[0] * np.log(i[1] * np.array(x) + i[2])
mse = mean_squared_error(y, y_estimation)
if mse < min_mse:
min_mse = mse
parameters = [i[0], i[1], i[2]]
return (min_mse, parameters)
You can see in the image below the orange curve is the data I have and the blue line is my fitted line. We see that the line stretch a bit away from the line on the end and I would like to avoid that to improve the prediction from my function.
logarithm function graph
My question is twofold:
Is this actually the best way to do it or is it best to use another function (such as the increasing form of an Exponential Decay)? (y = C ( 1 - e-kt ), k > 0)
How can I change my code so that the last values are more important to be fitted than the first ones.
Usually, in non-linear least-squares, the inverse of the y values is taken as weight, that essentially eliminates outliers, you can expand on that idea by adding a function to calculate the weight based on the x position.
def xWeightA(x):
container=[]
for k in range(len(x)):
if k<int(0.9*len(x)):
container.append(1)
else:
container.append(1.2)
return container
def approximate_log_function(x, y):
C = np.arange(0.01, 1, step = 0.01)
a = np.arange(0.01, 1, step = 0.01)
b = np.arange(0.01, 1, step = 0.01)
min_mse = 9999999999
parameters = [0, 0, 0]
LocalWeight=xWeightA(x)
for i in np.array(np.meshgrid(C, a, b)).T.reshape(-1, 3):
y_estimation = LocalWeight*i[0] * np.log(i[1] * np.array(x) + i[2])
mse = mean_squared_error(y, y_estimation)
if mse < min_mse:
min_mse = mse
parameters = [i[0], i[1], i[2]]
return (min_mse, parameters)
Also, it looks like you're evaluating through the complete objective function, that makes the code to take to much time to find the minimum (at least on my machine). You can use curve_fit or polyfit as suggested, but if the goal is to generate the optimizer try adding an early break or a random search through the grid. Hope it helps
I'm trying to write some very simple LMS batch gradient descent but I believe I'm doing something wrong with the gradient. The ratio between the order of magnitude and the initial values for theta is very different for the elements of theta so either theta[2] doesn't move (e.g. if alpha = 1e-8) or theta[1] shoots off (e.g. if alpha = .01).
import numpy as np
y = np.array([[400], [330], [369], [232], [540]])
x = np.array([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x = np.concatenate((np.ones((5,1), dtype=np.int), x), axis=1)
theta = np.array([[0.], [.1], [50.]])
alpha = .01
for i in range(1,1000):
h = np.dot(x, theta)
gradient = np.sum((h - y) * x, axis=0, keepdims=True).transpose()
theta -= alpha * gradient
print ((h - y)**2).sum(), theta.squeeze().tolist()
The algorithm as written is completely correct, but without feature scaling, convergence will be extremely slow as one feature will govern the gradient calculation.
You can perform the scaling in various ways; for now, let us just scale the features by their L^1 norms because it's simple
import numpy as np
y = np.array([[400], [330], [369], [232], [540]])
x_orig = np.array([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x_orig = np.concatenate((np.ones((5,1), dtype=np.int), x_orig), axis=1)
x_norm = np.sum(x_orig, axis=0)
x = x_orig / x_norm
That is, the sum of every column in x is 1. If you want to retain your good guess at the correct parameters, those have to be scaled accordingly.
theta = (x_norm*[0., .1, 50.]).reshape(3, 1)
With this, we may proceed as you did in your original post, where again you will have to play around with the learning rate until you find a sweet spot.
alpha = .1
for i in range(1, 100000):
h = np.dot(x, theta)
gradient = np.sum((h - y) * x, axis=0, keepdims=True).transpose()
theta -= alpha * gradient
Let's see what we get now that we've found something that seems to converge. Again, your parameters will have to be scaled to relate to the original unscaled features.
print (((h - y)**2).sum(), theta.squeeze()/x_norm)
# Prints 1444.14443271 [ -7.04344646e+01 6.38435468e-02 1.03435881e+02]
At this point, let's cheat and check our results
theta, error, _, _ = np.linalg.lstsq(x_orig, y)
print(error, theta)
# Prints [ 1444.1444327] [[ -7.04346018e+01]
# [ 6.38433756e-02]
# [ 1.03436047e+02]]
A general introductory reference on feature scaling is this Stanford lecture.