I'm trying to write some very simple LMS batch gradient descent but I believe I'm doing something wrong with the gradient. The ratio between the order of magnitude and the initial values for theta is very different for the elements of theta so either theta[2] doesn't move (e.g. if alpha = 1e-8) or theta[1] shoots off (e.g. if alpha = .01).
import numpy as np
y = np.array([[400], [330], [369], [232], [540]])
x = np.array([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x = np.concatenate((np.ones((5,1), dtype=np.int), x), axis=1)
theta = np.array([[0.], [.1], [50.]])
alpha = .01
for i in range(1,1000):
h = np.dot(x, theta)
gradient = np.sum((h - y) * x, axis=0, keepdims=True).transpose()
theta -= alpha * gradient
print ((h - y)**2).sum(), theta.squeeze().tolist()
The algorithm as written is completely correct, but without feature scaling, convergence will be extremely slow as one feature will govern the gradient calculation.
You can perform the scaling in various ways; for now, let us just scale the features by their L^1 norms because it's simple
import numpy as np
y = np.array([[400], [330], [369], [232], [540]])
x_orig = np.array([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x_orig = np.concatenate((np.ones((5,1), dtype=np.int), x_orig), axis=1)
x_norm = np.sum(x_orig, axis=0)
x = x_orig / x_norm
That is, the sum of every column in x is 1. If you want to retain your good guess at the correct parameters, those have to be scaled accordingly.
theta = (x_norm*[0., .1, 50.]).reshape(3, 1)
With this, we may proceed as you did in your original post, where again you will have to play around with the learning rate until you find a sweet spot.
alpha = .1
for i in range(1, 100000):
h = np.dot(x, theta)
gradient = np.sum((h - y) * x, axis=0, keepdims=True).transpose()
theta -= alpha * gradient
Let's see what we get now that we've found something that seems to converge. Again, your parameters will have to be scaled to relate to the original unscaled features.
print (((h - y)**2).sum(), theta.squeeze()/x_norm)
# Prints 1444.14443271 [ -7.04344646e+01 6.38435468e-02 1.03435881e+02]
At this point, let's cheat and check our results
theta, error, _, _ = np.linalg.lstsq(x_orig, y)
print(error, theta)
# Prints [ 1444.1444327] [[ -7.04346018e+01]
# [ 6.38433756e-02]
# [ 1.03436047e+02]]
A general introductory reference on feature scaling is this Stanford lecture.
Related
I'm just starting out learning machine learning and have been trying to fit a polynomial to data generated with a sine curve. I know how to do this in closed form, but I'm trying to get it to work with gradient descent too.
However, my weights explode to crazy heights, even with a very large penalty term. What am I doing wrong?
Here is the code:
import numpy as np
import matplotlib.pyplot as plt
from math import pi
N = 10
D = 5
X = np.linspace(0,100, N)
Y = np.sin(0.1*X)*50
X = X.reshape(N, 1)
Xb = np.array([[1]*N]).T
for i in range(1, D):
Xb = np.concatenate((Xb, X**i), axis=1)
#Randomly initializie the weights
w = np.random.randn(D)/np.sqrt(D)
#Solving in closed form works
#w = np.linalg.solve((Xb.T.dot(Xb)),Xb.T.dot(Y))
#Yhat = Xb.dot(w)
#Gradient descent
learning_rate = 0.0001
for i in range(500):
Yhat = Xb.dot(w)
delta = Yhat - Y
w = w - learning_rate*(Xb.T.dot(delta) + 100*w)
print('Final w: ', w)
plt.scatter(X, Y)
plt.plot(X,Yhat)
plt.show()
Thanks!
When updating theta, you have to take theta and subtract it with the learning weight times the derivative of theta divided by the training set size. You also have to divide your penality term by the training size set. But the main problem is that your learning rate is too large. For future debugging, it is helpful to print the cost to see if gradient descent is working and if the learning rate is too small or just right.
Below here is the code for 2nd degree polynomial which the found the optimum thetas (as you can see the learning rate is really small). I've also added the cost function.
N = 2
D = 2
#Gradient descent
learning_rate = 0.000000000001
for i in range(200):
Yhat = Xb.dot(w)
delta = Yhat - Y
print((1/N) * np.sum(np.dot(delta, np.transpose(delta))))
w = w - learning_rate*(np.dot(delta, Xb)) * (1/N)
When using LabelPropagation, I often run into this warning (imho it should be an error because it completely fails the propagation):
/usr/local/lib/python3.5/dist-packages/sklearn/semi_supervised/label_propagation.py:279: RuntimeWarning: invalid value encountered in true_divide
self.label_distributions_ /= normalizer
So after few tries with the RBF kernel, I discovered the paramater gamma has an influence.
EDIT:
The problem comes from these lines:
if self._variant == 'propagation':
normalizer = np.sum(
self.label_distributions_, axis=1)[:, np.newaxis]
self.label_distributions_ /= normalizer
I don't get how label_distributions_ can be all zeros, especially when its definition is:
self.label_distributions_ = safe_sparse_dot(
graph_matrix, self.label_distributions_)
Gamma has an influence on the graph_matrix (because graph_matrix is the result of _build_graph() that call the kernel function). OK. But still. Something's wrong
OLD POST (before edit)
I remind you how graph weights are computed for the propagation: W = exp(-gamma * D), D the pairwise distance matrix between all points of the dataset.
The problem is: np.exp(x) returns 0.0 if x very small.
Let's imagine we have two points i and j such that dist(i, j) = 10.
>>> np.exp(np.asarray(-10*40, dtype=float)) # gamma = 40 => OKAY
1.9151695967140057e-174
>>> np.exp(np.asarray(-10*120, dtype=float)) # gamma = 120 => NOT OKAY
0.0
In practice, I'm not setting gamma manually but I'm using the method described in this paper (section 2.4).
So, how would one avoid this division by zero to get a proper propagation ?
The only way I can think of is to normalize the dataset in every dimension, but we lose some geometric/topologic property of the dataset (a 2x10 rectangle becoming a 1x1 square for example)
Reproductible example:
In this example, it's worst: even with gamma = 20 it fails.
In [11]: from sklearn.semi_supervised.label_propagation import LabelPropagation
In [12]: import numpy as np
In [13]: X = np.array([[0, 0], [0, 10]])
In [14]: Y = [0, -1]
In [15]: LabelPropagation(kernel='rbf', tol=0.01, gamma=20).fit(X, Y)
/usr/local/lib/python3.5/dist-packages/sklearn/semi_supervised/label_propagation.py:279: RuntimeWarning: invalid value encountered in true_divide
self.label_distributions_ /= normalizer
/usr/local/lib/python3.5/dist-packages/sklearn/semi_supervised/label_propagation.py:290: ConvergenceWarning: max_iter=1000 was reached without convergence.
category=ConvergenceWarning
Out[15]:
LabelPropagation(alpha=None, gamma=20, kernel='rbf', max_iter=1000, n_jobs=1,
n_neighbors=7, tol=0.01)
In [16]: LabelPropagation(kernel='rbf', tol=0.01, gamma=2).fit(X, Y)
Out[16]:
LabelPropagation(alpha=None, gamma=2, kernel='rbf', max_iter=1000, n_jobs=1,
n_neighbors=7, tol=0.01)
In [17]:
Basically you're doing a softmax function, right?
The general way to prevent softmax from over/underflowing is (from here)
# Instead of this . . .
def softmax(x, axis = 0):
return np.exp(x) / np.sum(np.exp(x), axis = axis, keepdims = True)
# Do this
def softmax(x, axis = 0):
e_x = np.exp(x - np.max(x, axis = axis, keepdims = True))
return e_x / e_x.sum(axis, keepdims = True)
This bounds e_x between 0 and 1, and assures one value of e_x will always be 1 (namely the element np.argmax(x)). This prevents overflow and underflow (when np.exp(x.max()) is either bigger or smaller than float64 can handle).
In this case, as you can't change the algorithm, I would take the input D and make D_ = D - D.min() as this should be numerically equivalent to the above, as W.max() should be -gamma * D.min() (as you're just flipping the sign). The do your algorithm with regards to D_
EDIT:
As recommended by #PaulBrodersen below, you can build a "safe" rbf kernel based on the sklearn implementation here:
def rbf_kernel_safe(X, Y=None, gamma=None):
X, Y = sklearn.metrics.pairwise.check_pairwise_arrays(X, Y)
if gamma is None:
gamma = 1.0 / X.shape[1]
K = sklearn.metrics.pairwise.euclidean_distances(X, Y, squared=True)
K *= -gamma
K -= K.max()
np.exp(K, K) # exponentiate K in-place
return K
And then use it in your propagation
LabelPropagation(kernel = rbf_kernel_safe, tol = 0.01, gamma = 20).fit(X, Y)
Unfortunately I only have v0.18, which doesn't accept user-defined kernel functions for LabelPropagation, so I can't test it.
EDIT2:
Checking your source for why you have such large gamma values makes me wonder if you are using gamma = D.min()/3, which would be incorrect. The definition is sigma = D.min()/3 while the definition of sigma in w is
w = exp(-d**2/sigma**2) # Equation (1)
which would make the correct gamma value 1/sigma**2 or 9/D.min()**2
I'm trying to adapt a batch gradient descent algorithm from a previous question to do stochastic gradient descent, my cost seems to get stuck pretty far from the minimum value (in the example, around 1750 when the minimum is around 1450). It would seem like once it reaches that value, it just starts oscillating there. I also tried to shuffle range(0, x.shape[0]-1) every l but it didn't make any difference. I expect oscillations around the optimal value, but this just seemed too far off, so I think there must be a mistake.
import numpy as np
y = np.asfarray([[400], [330], [369], [232], [540]])
x = np.asfarray([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x = np.concatenate((np.ones((5,1)), x), axis=1)
theta = np.asfarray([[0], [.5], [.5]])
fscale = np.sum(x, axis=0)
x /= fscale
alpha = .1
for l in range(1,100000):
for i in range(0, x.shape[0]-1):
h = np.dot(x, theta)
gradient = ((h[i:i+1] - y[i:i+1]) * x[i:i+1]).T
theta -= alpha * gradient
print ((h - y)**2).sum(), theta.squeeze() / fscale
I have a specific analytical gradient I am using to calculate my cost f(x,y), and gradients dx and dy. It runs, but I can't tell if my gradient descent is broken. Should I plot my partial derivatives x and y?
import math
gamma = 0.00001 # learning rate
iterations = 10000 #steps
theta = np.array([0,5]) #starting value
thetas = []
costs = []
# calculate cost of any point
def cost(theta):
x = theta[0]
y = theta[1]
return 100*x*math.exp(-0.5*x*x+0.5*x-0.5*y*y-y+math.pi)
def gradient(theta):
x = theta[0]
y = theta[1]
dx = 100*math.exp(-0.5*x*x+0.5*x-0.0035*y*y-y+math.pi)*(1+x*(-x + 0.5))
dy = 100*x*math.exp(-0.5*x*x+0.5*x-0.05*y*y-y+math.pi)*(-y-1)
gradients = np.array([dx,dy])
return gradients
#for 2 features
for step in range(iterations):
theta = theta - gamma*gradient(theta)
value = cost(theta)
thetas.append(theta)
costs.append(value)
thetas = np.array(thetas)
X = thetas[:,0]
Y = thetas[:,1]
Z = np.array(costs)
iterations = [num for num in range(iterations)]
plt.plot(Z)
plt.xlabel("num. iteration")
plt.ylabel("cost")
I strongly recommend you check whether or not your analytic gradient is working correcly by first evaluating it against a numerical gradient.
I.e make sure that your f'(x) = (f(x+h) - f(x)) / h for some small h.
After that, make sure your updates are actually in the right direction by picking a point where you know x or y should decrease and then checking the sign of your gradient function output.
Of course make sure your goal is actually minimization vs maximization.
I am implementing gradient descent for regression using newtons method as explained in the 8.3 section of the Machine Learning A Probabilistic Perspective (Murphy) book. I am working with two dimensional data in this implementation. I am using following notations.
x = input data points m*2
y = labelled outputs(m) corresponding to input data
H = Hessian matrix is defined as
gradient descent update
where loss function is defined as
In my case
is array and H is
Here is my python implementation. However this is not working as cost is increasing in each iteration.
def loss(x,y,theta):
m,n = np.shape(x)
cost_list = []
for i in xrange(0,n):
x_0 = x[:,i].reshape((m,1))
predicted = np.dot(x_0, theta[i])
error = predicted - y
cost = np.sum(error ** 2) / m
cost_list.append(cost)
cost_list = np.array(cost_list).reshape((2,1))
return cost_list
def NewtonMethod(x,y,theta,maxIterations):
m,n = np.shape(x)
xTrans = x.transpose()
H = 2 * np.dot(xTrans,x) / m
Hinv = np.linalg.inv(H)
thetaPrev = np.zeros_like(theta)
best_iter = maxIterations
for i in range(0,maxIterations):
cost = loss(x,y,theta)
theta = theta - np.dot(Hinv,cost))
if(np.allclose(theta,thetaPrev,rtol=0.001,atol=0.001)):
break;
else:
thetaPrev = theta
best_iter = i
return theta
Here are the sample values I used
import numpy as np
x = np.array([[-1.7, -1.5],[-1.0 , -0.3],[ 1.7 , 1.5],[-1.2, -0.7 ][ 0.6, 0.1]])
y = np.array([ 0.3 , 0.07, -0.2, 0.07, 0.03 ])
theta = np.zeros(2)
NewtonMethod(x,y,theta,100)
Need help / suggestions to fix this problem.
Thanks
You are effectively using a step size of 1. Try reducing the step size and see if that helps. That is, Instead of
Do this:
with a smaller value than 1.