gradient descent for linear regression in python code - python

def computeCost(X, y, theta):
inner = np.power(((X * theta.T) - y), 2)
return np.sum(inner) / (2 * len(X))
def gradientDescent(X, y, theta, alpha, iters):
temp = np.matrix(np.zeros(theta.shape))
params = int(theta.ravel().shape[1]) #flattens
cost = np.zeros(iters)
for i in range(iters):
err = (X * theta.T) - y
for j in range(params):
term = np.multiply(err, X[:,j])
temp[0, j] = theta[0, j] - ((alpha / len(X)) * np.sum(term))
theta = temp
cost[i] = computeCost(X, y, theta)
return theta, cost
Here is the code for linear regression cost function and gradient descent that I've found on a tutorial, but I am not quite sure how it works.
First I get how the computeCost code works since it's just (1/2M) where M is number of data.
For gradientDescent code, I just don't understand how it works in general. I know the formula for updating the theta is something like
theta = theta - (learningRate) * derivative of J(cost function). But I am not sure where alpha / len(X)) * np.sum(term) this comes from on the line updating temp[0,j].
Please help me to understand!

I'll break this down for you. So in your gradientDescent function, you are taking in the predictor variables(X), target variable(y),
the weight matrix(theta) and two more parameters(alpha, iters) which are the training parameters. The function's job is to figure out how much should
each column in the predictor variables(X) set should be multiplied by before adding to get the predicted values of target variable(y).
In the first line of the function you are initiating a weight matrix called temp to zeros. This basically is the starting point of the
final weight matrix(theta) that the function will output in the end. The params variable is basically the number of weights or number of predictor variables.
This line makes more sense in the context of neural networks where you create abstraction of features.
In linear regression, the weight matrix will mostly be a one dimensional array. In a typical feed forward neural networks we take in a lets say 5 features,
convert it to maybe 4 or 6 or n number of features by using a weight matrix and so on. In linear regression we simply distill all the input
features into one output feature. So theta will essentially be equivalent to a row vector(which is evident by the line temp[0,j]=...) and params will be the number of features. We cost array just stores
that array at each iteration. Now on to the two loops. In the line err = (X * theta.T) - y under the first for loop, we are calculating the prediction
error for each training example. The shape of the err variable be number of examples,1. In the second for loop we are training the model,
that is we are gradually updating our term matrix. We run the second for loop iters
number of times. In Neural network context, we generally call it epochs. It is basically the number of times you want to train the model.
Now the line term = np.multiply(err, X[:,j]): here we are calculating the individual adjustment that should be made to each weight in the temp matrix.
We define the cost as (y_predicted-y_actual)**2/number_of_training_points, where y_predicted = X_1*w_1 + X_2*w_2 + X_3*w_3 +... If we differentiate this cost vwith respect to
a particular weight(say W_i) we get (y_predicted-y_actual)*(X_i)/number_of_training_points where X_i is the column that W_i multiplies with. So the term = line
basically computes that differentiation part. We can multiply the term variable with the learning rate and subtract from W_i.
But as you might notice the term variable is an array. But that is taken care of in the next line.
In the next line we take the average of the term (by summing it up and then dividing it by len(X))
and subtract it from the corresponding weight in the temp matrix. Once the weights have been updated and stored and the temp matrix,
we replace the original theta by temp. We repeat this process iters number of times

If you , instead of writing it like ((alpha / len(X)) * np.sum(term)), write it like (alpha * (np.sum(term) / len(X))) ,which you're allowed to do since multiplication and division are comutative (if i remember the term correctly) then you just multiply alpha by the average error, since term is length X anyway.
This means that you substract the learning rate (alpha) times average error which would be something like X[j] * tetha (actual) - y (ideal) , which incidentally is also close enough to the derivative of (X*tetha - y)^2

Related

The initial Gradient in Gradient Descent is abysmal and wrong

I am building in Python using IDLE 3.9 an input optimiser, which optimises a certain input "thetas" for a certain target output "targetRes" versus a calculated output obtained from running a model with input "thetas" in a solver. The way it works is that a model is first defined with a function called FEA(thetas, fem). After running the solver, FEA returns the output.
The chosen optimisation algorithm is Gradient Descent. FEA (output) is taken as the hypothesis function, and the target output is subtracted from it. The result is then squared to give the loss function. The gradient of the loss function is then determined using FDM. The update step then takes place. For now, I am only running the algorithm on thetas[0]. Below is the GD code:
targetRes = -0.1
thetas = [1000., 1., 1., 1., 1.] # input here initial value of unknown vector theta
def LF(thetas):
return (FEA(thetas, fem) - targetRes) ** 2 / 2
def FDM(thetas, LF):
fdm = []
for i, theta in enumerate(thetas):
h = 0.1
if i == 0:
print(h)
thetas_p_h = []
for t in thetas:
thetas_p_h.append(t)
thetas_p_h[i] += h
thetas_m_h = []
for t in thetas:
thetas_m_h.append(t)
thetas_m_h[i] -= h
grad = (LF(thetas_p_h) - LF(thetas_m_h)) / (2 * h)
fdm.append(grad)
return fdm
def GD(thetas, LF):
tol = 0.000001
alpha = 10000000
Nmax = 1000
for n in range(Nmax):
gradient = FDM(thetas, LF)
thetas_new = []
for gradient_item, theta_item in zip(gradient, thetas):
t_new = theta_item - (alpha * gradient_item)
thetas_new.append(t_new)
print(thetas, f'gradient = {gradient}', LF(thetas), thetas_new, LF(thetas_new))
if tol >= abs(LF(thetas_new) - LF(thetas)):
print(f"solution converged in {n} iterations, theta = {thetas_new}, LF(theta) = {LF(thetas_new)}, FEA(theta) = {FEA(thetas_new, fem)}")
return thetas_new
thetas = thetas_new
else:
print(f"reached max iterations, Nmax = {Nmax}")
return None
GD(thetas, LF)
As you can see the gradient descent algorithm I am using is different to the linear regression type, and that is because there are no features to evaluate, just labels (y). Unfortunately I am not allowed to provide the solver code and most likely not allowed to provide the model code as well, defined in FEA.
In the current example, the calculated initial output is -0.070309. My issues are:
The gradient is very minute at the first update iteration, and it is 2.078369999999961e-06, and the final update iteration gradient value is 1.834250000000102e-08. In fact the first update iteration gradient value is most likely wrong, as I attempted to calculate it by hand and I got a value of somewhere around 0.000005, which is still tiny.
I am using a gigantic learning rate value, because the algorithm is horribly slow without such.
The algorithm converges to target output only with certain input values. There was another case where I had other values of thetas and the algorithm at zeroth iteration produces a loss function of order of 10^7, and at the first iteration it drops right down to zero and converges, which is unrealistic. In both cases I use a large learning rate, and both cases converge to the target output. In some other cases I have other inputs and the algorithm does not converge to the target output. Other cases lead to a negative value of thetas[0] which causes the solver to fail.
Also dismiss the fact that I am not using any external libraries; it is intended. So of course without trying to run the code, any observations? Does anyone see anything obvious that I'm missing here? As I think this could be the issue. Could it be due to the orders of magnitude of the inputs? What are your thoughts? (there are no issues with the solver or model function named FEA, both have been repeatedly verified to work perfectly fine).

Checking Neural Network Gradient with Finite Difference Methods Doesn't Work

After a full week of print statements, dimensional analysis, refactoring, and talking through the code out loud, I can say I'm completely stuck.
The gradients my cost function produces are too far from those produced by finite differences.
I have confirmed my cost function produces correct costs for regularized inputs and not. Here's the cost function:
def nnCost(nn_params, X, y, lambda_, input_layer_size, hidden_layer_size, num_labels):
# reshape parameter/weight vectors to suit network size
Theta1 = np.reshape(nn_params[:hidden_layer_size * (input_layer_size + 1)], (hidden_layer_size, (input_layer_size + 1)))
Theta2 = np.reshape(nn_params[(hidden_layer_size * (input_layer_size+1)):], (num_labels, (hidden_layer_size + 1)))
if lambda_ is None:
lambda_ = 0
# grab number of observations
m = X.shape[0]
# init variables we must return
cost = 0
Theta1_grad = np.zeros(Theta1.shape)
Theta2_grad = np.zeros(Theta2.shape)
# one-hot encode the vector y
y_mtx = pd.get_dummies(y.ravel()).to_numpy()
ones = np.ones((m, 1))
X = np.hstack((ones, X))
# layer 1
a1 = X
z2 = Theta1#a1.T
# layer 2
ones_l2 = np.ones((y.shape[0], 1))
a2 = np.hstack((ones_l2, sigmoid(z2.T)))
z3 = Theta2#a2.T
# layer 3
a3 = sigmoid(z3)
reg_term = (lambda_/(2*m)) * (np.sum(np.sum(np.multiply(Theta1, Theta1))) + np.sum(np.sum(np.multiply(Theta2,Theta2))) - np.subtract((Theta1[:,0].T#Theta1[:,0]),(Theta2[:,0].T#Theta2[:,0])))
cost = (1/m) * np.sum((-np.log(a3).T * (y_mtx) - np.log(1-a3).T * (1-y_mtx))) + reg_term
# BACKPROPAGATION
# δ3 equals the difference between a3 and the y_matrix
d3 = a3 - y_mtx.T
# δ2 equals the product of δ3 and Θ2 (ignoring the Θ2 bias units) multiplied element-wise by the g′() of z2 (computed back in Step 2).
d2 = Theta2[:,1:].T#d3 * sigmoidGradient(z2)
# Δ1 equals the product of δ2 and a1.
Delta1 = d2#a1
Delta1 /= m
# Δ2 equals the product of δ3 and a2.
Delta2 = d3#a2
Delta2 /= m
reg_term1 = (lambda_/m) * np.append(np.zeros((Theta1.shape[0],1)), Theta1[:,1:], axis=1)
reg_term2 = (lambda_/m) * np.append(np.zeros((Theta2.shape[0],1)), Theta2[:,1:], axis=1)
Theta1_grad = Delta1 + reg_term1
Theta2_grad = Delta2 + reg_term2
grad = np.append(Theta1_grad.ravel(), Theta2_grad.ravel())
return cost, grad
Here's the code to check the gradients. I have been over every line and there is nothing whatsoever that I can think of to change here. It seems to be in working order.
def checkNNGradients(lambda_):
"""
Creates a small neural network to check the backpropagation gradients.
Credit: Based on the MATLAB code provided by Dr. Andrew Ng, Stanford Univ.
Input: Regularization parameter, lambda, as int or float.
Output: Analytical gradients produced by backprop code and the numerical gradients (computed
using computeNumericalGradient). These two gradient computations should result in
very similar values.
"""
input_layer_size = 3
hidden_layer_size = 5
num_labels = 3
m = 5
# generate 'random' test data
Theta1 = debugInitializeWeights(hidden_layer_size, input_layer_size)
Theta2 = debugInitializeWeights(num_labels, hidden_layer_size)
# reusing debugInitializeWeights to generate X
X = debugInitializeWeights(m, input_layer_size - 1)
y = np.ones(m) + np.remainder(np.range(m), num_labels)
# unroll parameters
nn_params = np.append(Theta1.ravel(), Theta2.ravel())
costFunc = lambda p: nnCost(p, X, y, lambda_, input_layer_size, hidden_layer_size, num_labels)
cost, grad = costFunc(nn_params)
numgrad = computeNumericalGradient(costFunc, nn_params)
# examine the two gradient computations; two columns should be very similar.
print('The columns below should be very similar.\n')
# Credit: http://stackoverflow.com/a/27663954/583834
print('{:<25}{}'.format('Numerical Gradient', 'Analytical Gradient'))
for numerical, analytical in zip(numgrad, grad):
print('{:<25}{}'.format(numerical, analytical))
# If you have a correct implementation, and assuming you used EPSILON = 0.0001
# in computeNumericalGradient.m, then diff below should be less than 1e-9
diff = np.linalg.norm(numgrad-grad)/np.linalg.norm(numgrad+grad)
print(diff)
print("\n")
print('If your backpropagation implementation is correct, then \n' \
'the relative difference will be small (less than 1e-9). \n' \
'\nRelative Difference: {:.10f}'.format(diff))
The check function generates its own data using a debugInitializeWeights function (so there's the reproducible example; just run that and it will call the other functions), and then calls the function that calculates the gradient using finite differences. Both are below.
def debugInitializeWeights(fan_out, fan_in):
"""
Initializes the weights of a layer with fan_in
incoming connections and fan_out outgoing connections using a fixed
strategy.
Input: fan_out, number of outgoing connections for a layer as int; fan_in, number
of incoming connections for the same layer as int.
Output: Weight matrix, W, of size(1 + fan_in, fan_out), as the first row of W handles the "bias" terms
"""
W = np.zeros((fan_out, 1 + fan_in))
# Initialize W using "sin", this ensures that the values in W are of similar scale;
# this will be useful for debugging
W = np.sin(range(1, np.size(W)+1)) / 10
return W.reshape(fan_out, fan_in+1)
def computeNumericalGradient(J, nn_params):
"""
Computes the gradient using "finite differences"
and provides a numerical estimate of the gradient (i.e.,
gradient of the function J around theta).
Credit: Based on the MATLAB code provided by Dr. Andrew Ng, Stanford Univ.
Inputs: Cost, J, as computed by nnCost function; Parameter vector, theta.
Output: Gradient vector using finite differences. Per Dr. Ng,
'Sets numgrad(i) to (a numerical approximation of) the partial derivative of
J with respect to the i-th input argument, evaluated at theta. (i.e., numgrad(i) should
be the (approximately) the partial derivative of J with respect
to theta(i).)'
"""
numgrad = np.zeros(nn_params.shape)
perturb = np.zeros(nn_params.shape)
e = .0001
for i in range(np.size(nn_params)):
# Set perturbation (i.e., noise) vector
perturb[i] = e
# run cost fxn w/ noise added to and subtracted from parameters theta in nn_params
cost1, grad1 = J((nn_params - perturb))
cost2, grad2 = J((nn_params + perturb))
# record the difference in cost function ouputs; this is the numerical gradient
numgrad[i] = (cost2 - cost1) / (2*e)
perturb[i] = 0
return numgrad
The code is not for class. That MOOC was in MATLAB and it's over. This is for me. Other solutions exist on the web; looking at them has proved fruitless. Everyone has a different (inscrutable) approach. So, I'm in serious need of assistance or a miracle.
Edit/Update: Fortran ordering when raveling vectors influences the outcome, but I have not been able to get the gradients to move together changing that option.
One thought: I think your perturbation is a little large, being 1e-4. For double precision floating point numbers, it should be more like 1e-8, i.e., the root of the machine precision (or are you working with single precision?!).
That being said, finite differences can be very bad approximations to true derivatives. Specifically, floating point computations in numpy are not deterministic, as you seem to have found out. The noise in evaluations can cancel out many significant digits under some circumstances. What values are you seeing and what are you expecting?
All of the following figured into the solution to my problem. For those attempting to translate MATLAB code to Python, whether from Andrew NG's Coursera Machine Learning course or not, these are things everyone should know.
MATLAB does everything in FORTRAN order; Python does everything in C order. This affects how vectors are populated and, thus, your results. You should always be in FORTRAN order, if you want your answers to match what you did in MATLAB. See docs
Getting your vectors in FORTRAN order can be as easy as passing order='F' as an argument to .reshape(), .ravel(), or .flatten(). You may, however, achieve the same thing if you are using .ravel() by transposing the vector then applying the .ravel() function like so X.T.ravel().
Speaking of .ravel(), the .ravel() and .flatten() functions do not do the same thing and may have different use cases. For example, .flatten() is preferred by SciPy optimization methods. So, if your equivalent of fminunc isn't working, it's likely because you forgot to .flatten() your response vector y. See this Q&A StackOverflow and docs on .ravel() which link to .flatten().More Docs
If you're translating your code from MATLAB live script into a Jupyter notebook or Google COLAB, you must police your name space. On one occasion, I found that the variable I thought was being passed was not actually the variable that was being passed. Why? Jupyter and Colab notebooks have a lot of global variables that one would never write ordinarily.
There is a better function to evaluate the differences between numerical and analytical gradients: Relative Error Comparison np.abs(numerical-analyitical)/(numerical+analytical). Read about it here CS231 Also, consider the accepted post above.

How to write dice-loss backpropogation with numpy

I am trying to write a dice-loss function by myself. Here is the forward pass I wrote. But I did not understand how to calculate the backprop. I tried to write some but it's not working. Or dice-loss does not needs back prop at all?
alpha = 0.5
belta = 0.5
tp = np.sum(pred * label)
fn = np.sum((1- pred ) * label)
fp = np.sum(pred * (1 - label))
dice = tp / (tp + alpha * fn + belta * fp)
I am not sure that I would call that a forward pass. How do you get the pred ?
Typically you need to write down the step leading to pred. Then you get your loss as you did. This defines a computational graph. And from there can start the backward pass (or backpropagation). You need to calculate the gradient starting from the end of the computational graph, and proceed backward to get the gradient of the loss with respect to the weights.
I wrote an introduction to backpropagation in a blog post (https://www.qwertee.io/blog/an-introduction-to-backpropagation) , and I suppose you should find more details about how to do it.
You just need to use calculus and the chain rule to solve this.
The Dice is defined as . X is your pred and Y is your label.
For a matrix X and Y of size MxN, we can write this as .
Apply the quotient rule for an arbitrary value i,j in X:
We can now divide by |X|+|Y| and that gives us a prety neat solution:
In python:
dX = (2*Y-dice)/(np.sum(X)+np.sum(Y))
And if you add a smooth term in the numerator and denominator of your dice:
dX = (2*Y-dice)/(np.sum(X)+np.sum(Y)+eps)
Also if you save the Dice, |X| and |Y| variables in your forward calculation, you don't have to calculate them again in the backwards pass.

Closed Form Ridge Regression

I am having trouble understanding the output of my function to implement multiple-ridge regression. I am doing this from scratch in Python for the closed form of the method. This closed form is shown below:
I have a training set X that is 100 rows x 10 columns and a vector y that is 100x1.
My attempt is as follows:
def ridgeRegression(xMatrix, yVector, lambdaRange):
wList = []
for i in range(1, lambdaRange+1):
lambVal = i
# compute the inner values (X.T X + lambda I)
xTranspose = np.transpose(x)
xTx = xTranspose # x
lamb_I = lambVal * np.eye(xTx.shape[0])
# invert inner, e.g. (inner)**(-1)
inner_matInv = np.linalg.inv(xTx + lamb_I)
# compute outer (X.T y)
outer_xTy = np.dot(xTranspose, y)
# multiply together
w = inner_matInv # outer_xTy
wList.append(w)
print(wList)
For testing, I am running it with the first 5 lambda values.
wList becomes 5 numpy.arrays each of length 10 (I'm assuming for the 10 coefficients).
Here is the first of those 5 arrays:
array([ 0.29686755, 1.48420319, 0.36388528, 0.70324668, -0.51604451,
2.39045735, 1.45295857, 2.21437745, 0.98222546, 0.86124358])
My question, and clarification:
Shouldn't there be 11 coefficients, (1 for the y-intercept + 10 slopes)?
How do I get the Minimum Square Error from this computation?
What comes next if I wanted to plot this line?
I think I am just really confused as to what I'm looking at, since I'm still working on my linear-algebra.
Thanks!
First, I would modify your ridge regression to look like the following:
import numpy as np
def ridgeRegression(X, y, lambdaRange):
wList = []
# Get normal form of `X`
A = X.T # X
# Get Identity matrix
I = np.eye(A.shape[0])
# Get right hand side
c = X.T # y
for lambVal in range(1, lambdaRange+1):
# Set up equations Bw = c
lamb_I = lambVal * I
B = A + lamb_I
# Solve for w
w = np.linalg.solve(B,c)
wList.append(w)
return wList
Notice that I replaced your inv call to compute the matrix inverse with an implicit solve. This is much more numerically stable, which is an important consideration for these types of problems especially.
I've also taken the A=X.T#X computation, identity matrix I generation, and right hand side vector c=X.T#y computation out of the loop--these don't change within the loop and are relatively expensive to compute.
As was pointed out by #qwr, the number of columns of X will determine the number of coefficients you have. You have not described your model, so it's not clear how the underlying domain, x, is structured into X.
Traditionally, one might use polynomial regression, in which case X is the Vandermonde Matrix. In that case, the first coefficient would be associated with the y-intercept. However, based on the context of your question, you seem to be interested in multivariate linear regression. In any case, the model needs to be clearly defined. Once it is, then the returned weights may be used to further analyze your data.
Typically to make notation more compact, the matrix X contains a column of ones for an intercept, so if you have p predictors, the matrix is dimensions n by p+1. See Wikipedia article on linear regression for an example.
To compute in-sample MSE, use the definition for MSE: the average of squared residuals. To compute generalization error, you need cross-validation.
Also, you shouldn't take lambVal as an integer. It can be small (close to 0) if the aim is just to avoid numerical error when xTx is ill-conditionned.
I would advise you to use a logarithmic range instead of a linear one, starting from 0.001 and going up to 100 or more if you want to. For instance you can change your code to that:
powerMin = -3
powerMax = 3
for i in range(powerMin, powerMax):
lambVal = 10**i
print(lambVal)
And then you can try a smaller range or a linear range once you figure out what is the correct order of lambVal with your data from cross-validation.

Batch Gradient Descent for Logistic Regression

I've been following Andrew Ng CSC229 machine learning course, and am now covering logistic regression. The goal is to maximize the log likelihood function and find the optimal values of theta to do so. The link to the lecture notes is: [http://cs229.stanford.edu/notes/cs229-notes1.ps][1] -pages 16-19. Now the code below was shown on the course homepage (in matlab though--I converted it to python).
I'm applying it to a data set with 100 training examples (a data set given on the Coursera homepage for a introductory machine learning course). The data has two features which are two scores on two exams. The output is a 1 if the student received admission and 0 is the student did not receive admission. The have shown all of the code below. The following code causes the likelihood function to converge to maximum of about -62. The corresponding values of theta are [-0.05560301 0.01081111 0.00088362]. Using these values when I test out a training example like [1, 30.28671077, 43.89499752] which should give a value of 0 as output, I obtain 0.576 which makes no sense to me. If I test the hypothesis function with input [1, 10, 10] I obtain 0.515 which once again makes no sense. These values should correspond to a lower probability. This has me quite confused.
import numpy as np
import sig as s
def batchlogreg(X, y):
max_iterations = 800
alpha = 0.00001
(m,n) = np.shape(X)
X = np.insert(X, 0, 1, 1)
theta = np.array([0] * (n+1), 'float')
ll = np.array([0] * max_iterations, 'float')
for i in range(max_iterations):
hx = s.sigmoid(np.dot(X, theta))
d = y - hx
theta = theta + alpha*np.dot(np.transpose(X),d)
ll[i] = sum(y * np.log(hx) + (1-y) * np.log(1- hx))
return (theta, ll)
Note that the sigmoid function has:
sig(0) = 0.5
sig(x > 0) > 0.5
sig(x < 0) < 0.5
Since you get all probabilities above 0.5, this suggests that you never make X * theta negative, or that you do, but your learning rate is too small to make it matter.
for i in range(max_iterations):
hx = s.sigmoid(np.dot(X, theta)) # this will probably be > 0.5 initially
d = y - hx # then this will be "very" negative when y is 0
theta = theta + alpha*np.dot(np.transpose(X),d) # (1)
ll[i] = sum(y * np.log(hx) + (1-y) * np.log(1- hx))
The problem is most likely at (1). The dot product will be very negative, but your alpha is very small and will negate its effect. So theta will never decrease enough to properly handle correctly classifying labels that are 0.
Positive instances are then only barely correctly classified for the same reason: your algorithm does not discover a reasonable hypothesis under your number of iterations and learning rate.
Possible solution: increase alpha and / or the number of iterations, or use momentum.
It sounds like you could be confusing probabilities with assignments.
The probability will be a real number between 0.0 and 1.0. A label will be an integer (0 or 1). Logistic regression is a model that provides the probability of a label being 1 given the input features. To obtain a label value, you need to make a decision using that probability. An easy decision rule is that the label is 0 if the probability is less than 0.5, and 1 if the probability is greater than or equal to 0.5.
So, for the example you gave, the decisions would both be 1 (which means the model is wrong for the first example where it should be 0).
I came to the same question and found the reason.
Normalize X first or set a scale-comparable intercept like 50.
Otherwise contours of cost function are too "narrow". A big alpha makes it overshoot and a small alpha fails to progress.

Categories

Resources