I am trying to implement gradient descent on a dataset. Even though I tried everything, I couldn't make it work. So, I created a test case. I am trying my code on a random data and try to debug.
More specifically, what I am doing is, I am generating random vectors between 0-1 and random labels for these vectors. And try to over-fit the training data.
However, my weight vector gets bigger and bigger in each iteration. And then, I have infinities. So, I do not actually learn anything. Here is my code:
import numpy as np
import random
def getRandomVector(n):
return np.random.uniform(0,1,n)
def getVectors(m, n):
return [getRandomVector(n) for i in range(n)]
def getLabels(n):
return [random.choice([-1,1]) for i in range(n)]
def GDLearn(vectors, labels):
maxIterations = 100
stepSize = 0.01
w = np.zeros(len(vectors[0])+1)
for i in range(maxIterations):
deltaw = np.zeros(len(vectors[0])+1)
for i in range(len(vectors)):
temp = np.append(vectors[i], -1)
deltaw += ( labels[i] - np.dot(w, temp) ) * temp
w = w + ( stepSize * (-1 * deltaw) )
return w
vectors = getVectors(100, 30)
labels = getLabels(100)
w = GDLearn(vectors, labels)
print w
I am using LMS for loss function. So, in all iterations, my update is the following,
where w^i is the ith weight vector and R is the stepSize and E(w^i) is the loss function.
Here is my loss function. (LMS)
and here is how I derivated the loss function,
,
Now, my questions are:
Should I expect good results in this random scenario using Gradient Descent? (What is the theoretical bounds?)
If yes, what is my bug in my implementation?
PS: I tried several other maxIterations and stepSize parameters. Still not working.
PS2: This is the best way I can ask the question here. Sorry if the question is too specific. But it made me crazy. I really want to learn the problem.
Your code has a couple of faults:
In GetVectors() method, you did not actually use the input variable m;
In GDLearn() method, you have a double loop, but you use the same variable i as the loop variables in both loops. (I guess the logic is still right, but it's confusing).
The prediction error (labels[i] - np.dot(w, temp)) has the wrong sign.
Step size does matters. If I am using 0.01 as step size, the cost is increasing in each iteration. Changing it to be 0.001 solved the problem.
Here is my revised code based on your original code.
import numpy as np
import random
def getRandomVector(n):
return np.random.uniform(0,1,n)
def getVectors(m, n):
return [getRandomVector(n) for i in range(m)]
def getLabels(n):
return [random.choice([-1,1]) for i in range(n)]
def GDLearn(vectors, labels):
maxIterations = 100
stepSize = 0.001
w = np.zeros(len(vectors[0])+1)
for iter in range(maxIterations):
cost = 0
deltaw = np.zeros(len(vectors[0])+1)
for i in range(len(vectors)):
temp = np.append(vectors[i], -1)
prediction_error = np.dot(w, temp) - labels[i]
deltaw += prediction_error * temp
cost += prediction_error**2
w = w - stepSize * deltaw
print 'cost at', iter, '=', cost
return w
vectors = getVectors(100, 30)
labels = getLabels(100)
w = GDLearn(vectors, labels)
print w
Running result -- you can see the cost is decreasing with each iteration but with a diminishing return.
cost at 0 = 100.0
cost at 1 = 99.4114482617
cost at 2 = 98.8476022685
cost at 3 = 98.2977744556
cost at 4 = 97.7612851154
cost at 5 = 97.2377571222
cost at 6 = 96.7268325883
cost at 7 = 96.2281642899
cost at 8 = 95.7414151147
cost at 9 = 95.2662577529
cost at 10 = 94.8023744037
......
cost at 90 = 77.367904046
cost at 91 = 77.2744249433
cost at 92 = 77.1823702888
cost at 93 = 77.0917090883
cost at 94 = 77.0024111475
cost at 95 = 76.9144470493
cost at 96 = 76.8277881325
cost at 97 = 76.7424064707
cost at 98 = 76.6582748518
cost at 99 = 76.5753667579
[ 0.16232142 -0.2425511 0.35740632 0.22548442 0.03963853 0.19595213
0.20080207 -0.3921798 -0.0238925 0.13097533 -0.1148932 -0.10077534
0.00307595 -0.30111942 -0.17924479 -0.03838637 -0.23938181 0.1384443
0.22929163 -0.0132466 0.03325976 -0.31489526 0.17468025 0.01351012
-0.25926117 0.09444201 0.07637793 -0.05940019 0.20961315 0.08491858
0.07438357]
Related
I'm trying to do my own neural network implementation from scratch, but I'm having some problems. Specifically, with the way one of the terms grows exponentially as iterations advance, which makes the accuracy quickly reach a plateau and gradient descent not find any optimal solution.
The equations I'm using I reached on my own, while studying some of the available resources online. I chose to write my own equations so I could understand them better:
... (1) where L is the last layer, and the hadamard product is being performed, and A is the values from the Lth layer after the activation function, Z is the values before the activation function is applied.
... (2) where W is the matrix of weights from the i+1th layer, with the number of rows is equal to the number of neurons in the i+1th layer and columns equals the number of neurones in the ith layer.
... (3)
... (4)
I struggled a lot with these, but I finally was able to make it work and keep all the correct dimensions of each matrix. Now, implementing these equations into my code:
def back_prop(self):
delta = np.multiply(self.a_record[-1] - self.target, self.layers[-1].prime(self.z_record[-1]))
for i, layer in reversed(list(enumerate(self.layers))):
dw = np.matmul(delta, self.a_record[i].T)
db = np.matmul(delta, np.ones((self.n, 1)))
self.dw_record.append(dw)
self.db_record.append(db)
if i > 0:
delta = np.multiply(np.matmul(self.w[i].T, delta), layer.prime(self.z_record[i - 1]))
a_record and z_record are lists that store the arrays corresponding to each layer. Same for dw and db records. I wrote it like this so it is as close as possible to my equations.
Problem comes once I start to measure performance on any dataset, which remain stagnant after just a couple of iterations. I get this warning, since it's not an error it doesn't stop the script.
RuntimeWarning: overflow encountered in multiply delta = np.multiply(self.a_record[-1] - self.target, self.layers[-1].prime(self.z_record[-1]))
When I analize the values that come from delta I find that the value of delta is growing way too fast, it starts with small values but grows up to e+200 magnitude and then it just returns infinity, which makes the next delta a bunch of np.nan and all the weights and biases to stop changing.
I'm fairly certain that the equations are equivalent to other equations I found in other neural networks, so I don't really know why this is happening. I don't imagine the values from any randomly generated dataset to be so insanely big that it overflows after just a couple of iterations.
I will post the whole class I made:
class Network():
def __init__(self, input, target, layers, alpha = 0.1, iter= 1000):
self.input = np.array(input).T
self.y_true = target
#self.target = np.array(one_hot(target)).T
self.target = np.array(target).T
self.layers = layers
self.alpha = alpha
self.iter = iter
self.n = len(input)
self.w = []
self.b = []
for i, layer in enumerate(layers):
if i == 0:
self.w.append(np.random.rand(layer.neurons, len(self.input)) - 0.5)
else:
self.w.append(np.random.rand(layer.neurons, layers[i - 1].neurons) - 0.5)
self.b.append(np.random.rand(layer.neurons, 1) - 0.5)
self.a_record = []
self.z_record = []
self.dw_record = []
self.db_record = []
def forward_prop(self):
a = self.input
self.a_record.append(a)
for i, layer in enumerate(self.layers):
z = np.matmul(self.w[i], a) + self.b[i].reshape(-1, 1)
a = layer.func(z)
self.z_record.append(z)
self.a_record.append(a)
self.output = a
def back_prop(self):
delta = np.multiply(self.a_record[-1] - self.target, self.layers[-1].prime(self.z_record[-1]))
print(delta)
for i, layer in reversed(list(enumerate(self.layers))):
dw = np.matmul(delta, self.a_record[i].T)
db = np.matmul(delta, np.ones((self.n, 1)))
self.dw_record.append(dw)
self.db_record.append(db)
print(self.a_record[i].T)
print("------------------------------------------")
if i > 0:
delta = np.multiply(np.matmul(self.w[i].T, delta), layer.prime(self.z_record[i - 1]))
def gradient_desc(self):
w_copy = self.w
for i, (w, dw) in enumerate(zip(w_copy, self.dw_record[::-1])):
self.w[i] = w - self.alpha*dw
b_copy = self.b
for i, (b, db) in enumerate(zip(b_copy, self.db_record[::-1])):
self.b[i] = b - self.alpha*db
def fit(self):
for i in range(self.iter):
self.forward_prop()
self.back_prop()
self.gradient_desc()
if i % 40 == 0:
print(f"Iteration: {i + 1} / {self.iter} =====================================")
print(f"Accuracy: {get_accuracy(self.output, self.y_true)}")
I've been trying for a couple days now to figure this out, but I don't seem to find the error or maybe a workaround to prevent overflow. Maybe the equations are wrong from the start? Please any help would be greatly appreciated, thank you in advance.
I need to write a simple neural network that consists of 1 output node, one hidden layer of 3 nodes, and 1 input layer (variable size). For now I am just trying to train on the xor data so lets presume that there are 3 input nodes (one node represents the bias and is always 1). The data is labeled 0,1.
I did out the equations for backpropogation and found that despite being so simple, my code does not converge to the xor data being correct.
Let W be the 3x3 matrix of weights connecting the input and hidden layer, and w be the 1x3 matrix that connects the hidden to output layer. Here are some helper functions for my method
def feed_forward_predict(x, W, w):
sigmoid = lambda x: 1/(1+np.exp(-x))
z = np.array(list(map(sigmoid, np.matmul(W, x))))
L = sigmoid(np.matmul(w, z))
return [L, z, x]
this just takes in a value and makes a prediction using the formula sig(w*sig(W*x)). We also have
def calculate_objective(data, labels, W, w):
obj = 0
for point, label in zip(data, labels):
L, z, x = feed_forward_predict(point, W, w)
obj += (label - L)**2
return obj
which calculates the Mean Squared Error for a bunch of given data points. Both of these functions should work as I checked them by hand. Now the problem comes in for the back propogation algorithm
def back_prop(traindata, trainlabels):
sigmoid = lambda x: 1/(1+np.exp(-x))
sigmoid_prime = lambda x: np.exp(-x)/((1+np.exp(-x))**2)
W = np.random.rand(3, len(traindata[0]))
w = np.random.rand(1, 3)
obj = calculate_objective(traindata, trainlabels, W, w)
print(obj)
epochs = 10_000
eta = .01
prevobj = np.inf
i=0
while(i < epochs):
prevobj = obj
dellw = np.zeros((1,3))
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
dellw += 2*(y - label) * sigmoid_prime(np.dot(w, z)) * z
w -= eta * dellw
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
temp = 2 * (y - label) * sigmoid_prime(np.dot(w, z))
# Note that s,u,v represent the hidden node weights. My professor required it this way
dells = temp * w[0][0] * sigmoid_prime(np.matmul(W[0,:], x)) * x
dellu = temp * w[0][1] * sigmoid_prime(np.matmul(W[1,:], x)) * x
dellv = temp * w[0][2] * sigmoid_prime(np.matmul(W[2,:], x)) * x
dellW = np.array([dells, dellu, dellv])
W -= eta*dellW
obj = calculate_objective(traindata, trainlabels, W, w)
i = i + 1
print("i=", i, " Objective=",obj)
return [W, w]
However this code, despite seemingly being correct in terms of the matrix multiplications and derivatives I took, does not converge to anything. In fact the error consistantly bounces: it will fall, then rise, then fall back to the same spot, then rise again. I believe that the problem lies with the W matrix gradient but I do not know what exactly it is.
If you'd like to see for yourself what is happening, the input data I used is
0: 0 0 1
0: 1 1 1
1: 1 0 1
1: 0 1 1
where the first number represents the label. I also set the random seed to np.random.seed(0) just so that I could be consistant with my matrices I'm dealing with.
It appears you are attempting to setup a manual version of stochastic gradient decent with a fixed learning rate (a classic NN problem).
Some notes on your code. It is very difficult to follow all the steps you are doing with so much loops and inconsistencies. In general, it defeats the purpose of using np.array() if you are using loops. Likewise you should know that np.matmul() is * and np.dot() is #. It is unclear how you are using the derivative. You have it explicitly stated at the start for the activation function and then partially derived in the middle of your loop for the MSE. Ugh.
Some other pointers. Explicitly state all your functions and your data, those should be globals. Those should also be derived all at once based on your fixed data as np.array(). In particular, note that while traditional statistics (like finding the line of best fit) means that we are solving for a fixed set of weights given a random variable; in stochastic gradient decent, we are doing the opposite. We are instead fixing the random variable to our data and optimizing our weights. Hence, your functions should only have your weights as "free variables", everything else is fixed. It is important to follow what is being fixed and what is free to update. Your code does not reflect that you know what is being update and what is fixed.
SGD algorithm outline:
Random params.
Update params by moving params a small percentage in the direction of lowest decent.
Run step (2) for a specified amount of time.
Print your params.
Example of SGD code (here is an example of performing SGD to find the line of best fit for some data).
import numpy as np
#Data
X = np.random.random((100,)) #Random points
Y = (2.3*X + 8) + 0.1*np.random.random((100,)) #Linear model + Noise
#Functions (only free variable is the params) (we want the F of best fit under MSE)
F = lambda p : p[0]*X+p[1]
dF = lambda p : np.array([X,np.ones(X.shape)])
MSE = lambda p : (1/Y.shape[0])*((Y-F(p))**2).sum(0)
dMSE = lambda p : (1/Y.shape[0])*(-2*(Y-F(p))*dF(p)).sum(1)
#SGD loop
lr = 0.05
epochs = 1000
params = np.array([0.0,0.0])
for i in range(epochs):
params -= lr*dMSE(params)
print(params)
Hopefully, written this way it is super clear exactly where the subtraction of the gradient is occurring and exactly how it is calculated. Note also, in case it wasn't clear, the derivative in both dF and dMSE is with respect to the params. Obviously this is a toy problem that can be solved explicitly with the scipy module. Hence, SGD is a clearly useless way to optimize two variables.
from scipy.stats import linregress
params = linregress(X,Y)
print(params)
I think I figured it out, in my code I was not summing the hidden node weight derivatives and instead was assigning at every loop iteration. The correct version would be as follow
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
temp = 2 * (y - label) * sigmoid_prime(np.dot(w, z))
# Note that s,u,v represent the hidden node weights. My professor required it this way
dells += temp * w[0][0] * sigmoid_prime(np.matmul(W[0,:], x)) * x
dellu += temp * w[0][1] * sigmoid_prime(np.matmul(W[1,:], x)) * x
dellv += temp * w[0][2] * sigmoid_prime(np.matmul(W[2,:], x)) * x
I'm trying to write a code that return the parameters for ridge regression using gradient descent. Ridge regression is defined as
Where, L is the loss (or cost) function. w are the parameters of the loss function (which assimilates b). x are the data points. y are the labels for each vector x. lambda is a regularization constant. b is the intercept parameter (which is assimilated into w). So, L(w,b) = number
The gradient descent algorithm that I should implement looks like this:
Where ∇
is the gradient of L with respect to w. η
is a step size. t is the time or iteration counter.
My code:
def ridge_regression_GD(x,y,C):
x=np.insert(x,0,1,axis=1) # adding a feature 1 to x at beggining nxd+1
w=np.zeros(len(x[0,:])) # d+1
t=0
eta=1
summ = np.zeros(1)
grad = np.zeros(1)
losses = np.array([0])
loss_stry = 0
while eta > 2**-30:
for i in range(0,len(y)): # here we calculate the summation for all rows for loss and gradient
summ=summ+((y[i,]-np.dot(w,x[i,]))*x[i,])
loss_stry=loss_stry+((y[i,]-np.dot(w,x[i,]))**2)
losses=np.insert(losses,len(losses),loss_stry+(C*np.dot(w,w)))
grad=((-2)*summ)+(np.dot((2*C),w))
eta=eta/2
w=w-(eta*grad)
t+=1
summ = np.zeros(1)
loss_stry = 0
b=w[0]
w=w[1:]
return w,b,losses
The output should be the intercept parameter b, the vector w and the loss in each iteration, losses.
My problem is that when I run the code I get increasing values for w and for the losses, both in the order of 10^13.
Would really appreciate if you could help me out. If you need any more information or clarification just ask for it.
NOTE: This post was deleted from Cross Validated forum. If there's a better forum to post it please let me know.
After I check your code, turns out your implementation of Ridge regression is correct, the problem of increasing values for w which led to increasing losses you get is due to extreme and unstable update value of parameters (i.e abs(eta*grad) is too big), so I adjust the learning rate and weights decay rate to appropriate range and change the way you decay the learning rate then everything work as expected:
import numpy as np
sample_num = 100
x_dim = 10
x = np.random.rand(sample_num, x_dim)
w_tar = np.random.rand(x_dim)
b_tar = np.random.rand(1)[0]
y = np.matmul(x, np.transpose([w_tar])) + b_tar
C = 1e-6
def ridge_regression_GD(x,y,C):
x = np.insert(x,0,1,axis=1) # adding a feature 1 to x at beggining nxd+1
x_len = len(x[0,:])
w = np.zeros(x_len) # d+1
t = 0
eta = 3e-3
summ = np.zeros(x_len)
grad = np.zeros(x_len)
losses = np.array([0])
loss_stry = 0
for i in range(50):
for i in range(len(y)): # here we calculate the summation for all rows for loss and gradient
summ = summ + (y[i,] - np.dot(w, x[i,])) * x[i,]
loss_stry += (y[i,] - np.dot(w, x[i,]))**2
losses = np.insert(losses, len(losses), loss_stry + C * np.dot(w, w))
grad = -2 * summ + np.dot(2 * C,w)
w -= eta * grad
eta *= 0.9
t += 1
summ = np.zeros(1)
loss_stry = 0
return w[1:], w[0], losses
w, b, losses = ridge_regression_GD(x, y, C)
print("losses: ", losses)
print("b: ", b)
print("b_tar: ", b_tar)
print("w: ", w)
print("w_tar", w_tar)
x_pre = np.random.rand(3, x_dim)
y_tar = np.matmul(x_pre, np.transpose([w_tar])) + b_tar
y_pre = np.matmul(x_pre, np.transpose([w])) + b
print("y_pre: ", y_pre)
print("y_tar: ", y_tar)
Outputs:
losses: [ 0 1888 2450 2098 1128 354 59 5 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1]
b: 1.170527138363387
b_tar: 0.894306608050021
w: [0.7625987 0.6027163 0.58350218 0.49854847 0.52451963 0.59963663
0.65156702 0.61188389 0.74257133 0.67164963]
w_tar [0.82757802 0.76593551 0.74074476 0.37049698 0.40177269 0.60734677
0.72304859 0.65733725 0.91989305 0.79020028]
y_pre: [[3.44989377]
[4.77838804]
[3.53541958]]
y_tar: [[3.32865041]
[4.74528037]
[3.42093559]]
As you can see from losses change at outputs, the learning rate eta = 3e-3 is still bit two much, so the loss will go up at first few training episode, but start to drop when learning rate decay to appropriate value.
In this paper, a very simple model is described to illustrate how the ant colony algorithm works. In short, it assumes two nodes which are connected via two links one of which is shorter. Then, given a pheromone increment and a pheromone evaporation dynamics, one expects that all ants eventually pick the shorter path.
Now, I'm trying to replicate the simulation of this paper corresponding to scenario above whose result should be (more or less) like below.
Here is an implementation of mine (taking the same specification as that of the test above).
import random
import matplotlib.pyplot as plt
N = 10
l1 = 1
l2 = 2
ru = 0.5
Q = 1
tau1 = 0.5
tau2 = 0.5
epochs = 150
success = [0 for x in range(epochs)]
def compute_probability(tau1, tau2):
return tau1/(tau1 + tau2), tau2/(tau1 + tau2)
def select_path(prob1, prob2):
if prob1 > prob2:
return 1
if prob1 < prob2:
return 2
if prob1 == prob2:
return random.choice([1,2])
def update_accumulation(link_id):
global tau1
global tau2
if link_id == 1:
tau1 += Q / l1
return tau1
if link_id == 2:
tau2 += Q / l2
return tau2
def update_evapuration():
global tau1
global tau2
tau1 *= (1-ru)
tau2 *= (1-ru)
return tau1, tau2
def report_results(success):
plt.plot(success)
plt.show()
for epoch in range(epochs-1):
temp = 0
for ant in range(N-1):
prob1, prob2 = compute_probability(tau1, tau2)
selected_path = select_path(prob1,prob2)
if selected_path == 1:
temp += 1
update_accumulation(selected_path)
update_evapuration()
success[epoch] = temp
report_results(success)
However, what I get is fairly weird as below.
It seems that my understanding of how pheromone should be updated is flawed.
So, can one address what I am missing in this implementation?
Three problems in the proposed approach:
As #Mark mentioned in his comment, you need a weighted random choice. Otherwise the proposed approach will likely always pick one of the paths and the plot will result in a straight line as you show above. However, I think this was part of the solution, because even with this, you will likely still get a straight line because of early convergence, which led two problem two.
Ant Colony Optimization is a metaheuristic that needs several (hyper) parameters configured to guide the search for a certain solution (e.g., tau from above or number of ants). Fine tuning this parameters is important because you can converge early on a particular result (which is fine to some extent - if you want to use it as an heuristic). But the purpose of a metaheuristic is to provide you with some middle ground between the exact and heuristic algorithms, which makes the continous exploration/exploitation an important part of its workings. This means the parameters need to be careful optimised for your problem size/type.
Given that the ACO uses a probabilistic approach for guiding the search (and as the plot from the referenced paper is showing), you will need to run the experiment several times and compute some statistic on those numbers. In my case below, I computed the average over 100 samples.
import random
import matplotlib.pyplot as plt
N = 10
l1 = 1.1
l2 = 1.5
ru = 0.05
Q = 1
tau1 = 0.5
tau2 = 0.5
samples = 10
epochs = 150
success = [0 for x in range(epochs)]
def compute_probability(tau1, tau2):
return tau1/(tau1 + tau2), tau2/(tau1 + tau2)
def weighted_random_choice(choices):
max = sum(choices.values())
pick = random.uniform(0, max)
current = 0
for key, value in choices.items():
current += value
if current > pick:
return key
def select_path(prob1, prob2):
choices = {1: prob1, 2: prob2}
return weighted_random_choice(choices)
def update_accumulation(link_id):
global tau1
global tau2
if link_id == 1:
tau1 += Q / l1
else:
tau2 += Q / l2
def update_evaporation():
global tau1
global tau2
tau1 *= (1-ru)
tau2 *= (1-ru)
def report_results(success):
plt.ylim(0.0, 1.0)
plt.xlim(0, 150)
plt.plot(success)
plt.show()
for sample in range(samples):
for epoch in range(epochs):
temp = 0
for ant in range(N):
prob1, prob2 = compute_probability(tau1, tau2)
selected_path = select_path(prob1, prob2)
if selected_path == 1:
temp += 1
update_accumulation(selected_path)
update_evaporation()
ratio = ((temp + 0.0) / N)
success[epoch] += ratio
# reset pheromone values here to evaluate new sample
tau1 = 0.5
tau2 = 0.5
success = [x / samples for x in success]
for x in success:
print(x)
report_results(success)
The code above should return something close to the desired plot.
I have implemented a univariate linear regression in python. The code is given below:
import numpy as np
import matplotlib.pyplot as plt
x = np.array([1,2,4,3,5,7,9,11])
y = np.array([3,5,9,7,11,15,19,23])
def hypothesis(w0,w1,x):
return w0 + w1*x
def cost_cal(y,w0,w1,x,m):
diff = hypothesis(w0,w1,x)-y
diff_sqr = np.square(diff)
total_cost = np.sum(diff)
total_cost_sqr = (1/(2*m)) * np.sum(diff_sqr)
return total_cost, total_cost_sqr
def gradient_descent(w0,w1,alpha,x,m,y):
cost, cost_sqr = cost_cal(y,w0,w1,x,m)
temp0 = (alpha/m) * cost
temp1 = (alpha/m) * np.sum(cost*x)
w0 = w0 - temp0
w1 = w1 - temp1
return w0,w1
These are my hypothesis, cost, and gradient_descent functions implemented in python. When I use the initial weight w0 = 0 and w1 = 0, my minimized cost is 0.12589726000013188. But, if I initialize the w0 = -1 and w1 = -2, the minimized cost is 0.5035890400005265. What is the reason behind the different minimum costs using different initial weight values? As the error function MSE, is a convex function, shouldn't it reach the global minimum? Am I doing something wrong?
w0=0
w1=0
alpha =0.0001
m = 8
z = 5000
c = np.zeros(z)
cs = np.zeros(z)
w0_arr=np.zeros(z)
w1_arr=np.zeros(z)
index = np.zeros(z)
i = 0
while (i<z):
index[i] = i
c[i],cs[i] = cost_cal(y,w0,w1,x,m)
#print(i, c[i], cs[i])
w0, w1 = gradient_descent(w0,w1,alpha,x,m,y)
w0_arr[i],w1_arr[i] = w0,w1
i=i+1
inc = np.argmin(cs)
print(inc)
print(cs[inc])
The answer might vary based on your initial vector u choose in weight space. Apart from fact that the cost function is convex the curve has many critical points so it completely depends on the initial point or weights where we end up whether in local or global minima.
image link
https://1.bp.blogspot.com/-ltxplazySpc/XQG4aprY2iI/AAAAAAAABVo/xAqLIln9OWkig5rq4AU2sBFuPBuxW5CFQCLcBGAs/w1200-h630-p-k-no-nu/local_vs_global_minima.PNG
as per the image in the given link if u start from an initial point which is at the left corner we end up landing in global minima if we start from the right end we end up landing in local minima. Cost may vary by a huge difference but in most cases, the difference is not very large in case of local or global minima so if the cost is varying by big difference u need to cross-check once. Picking initial weights randomly is a good practice they should not be set manually.
in gradient_descent function, temp0 is assigned an array instead of the value, the sum of that array must be done before adding.