I'm trying to solve the CartPole-v1 problem from OpenAI by using backprop on a one-layer neural network - while updating the model at every time step using State action values (Q(s,a)). I'm unable to get the average reward to go up beyond about 42 steps per episode. Could anyone help? Is my approach even correct - as in, is it even possible for the agent to learn the optimal solution if I'm updating the Q-values every time-step, instead of batch updates every episode? Seems like theoretically it should be possible.
Details: After playing around and experimenting with activation functions, stochastic policies and finally settling on a deterministic policy with linear activation function and the parameters mentioned below - i'm able to get my agent to consistently converge (in about 100-300 steps) to an average reward of about 42 steps. But it doesn't go beyond 45. Adjusting the parameters (epsilon, discount_rate, and learning rate) in the program below does not have a huge impact on this.
I've tried looking for a similar solution online but none of them seem to fit the approach that I'm following. Almost all of the solutions involve learning at the end of each episode (by storing SARS' data).
Increasing the number of hidden layers doesn't help either. I also think it is unlikely that the algorithm will converge to a better value in future as I've run it for 10000+ episodes and it average reward is still around 40.
First, the hyperparameters:
epsilon = 0.5
lr = 0.05
discount_rate=0.9
# number of features in environment observations
num_inputs = 4
hidden_layer_nodes = 6
num_outputs = 2
The q function:
def calculateNNOutput(observation, m1, m2):
scaled_observation = scaleFeatures(observation)
hidden_layer = np.dot(scaled_observation, m1) # 1x4 X 4x6 -> 1x6
outputs = np.dot(hidden_layer, m2) # 1x6 X 6x2
return np.asmatrix(outputs) # 1x2
Action selection (policy):
def selectAction(observation):
#explore
global epsilon
if random.uniform(0,1) < epsilon:
return random.randint(0,1)
#exploit
outputs = calculateNNOutputs(observation)
print(outputs)
if (outputs[0,0] > outputs[0,1]):
return 0
else:
return 1
Backprop:
def backProp(prev_obs, m1, m2, experimental_values):
global lr
scaled_observation = np.asmatrix(scaleFeatures(prev_obs))
hidden_layer = np.asmatrix(np.dot(scaled_observation, m1)) #
outputs = np.asmatrix(np.dot(hidden_layer, m2)) # 1x6 X 6x2
delta_out = np.asmatrix((outputs-experimental_values)) # 1x2
delta_2=np.transpose(np.dot(m2,np.transpose(delta_out))) # 6x2 X 2x1 = 6x1_T = 1x6
GRADIENT_2 = (np.transpose(hidden_layer))*delta_out # 6x1 X 1x2 = 6x2 - same as w2
GRADIENT_1 = np.multiply(np.transpose(scaled_observation), delta_2) # 4 x 6 - same as w1
m1 = m1 - lr*GRADIENT_1
m2 = m2 - lr*GRADIENT_2
return m1, m2
Q-learning:
def updateWeights(prev_obs, action, obs, reward, done):
global weights_1, weights_2
calculated_value = calculateNNOutputs(prev_obs)
if done:
experimental_value = -1
else:
actionValues = calculateNNOutputs(obs) # 1x2
experimental_value = reward + discount_rate*(np.amax(actionValues, axis = 1)[0,0])
if action==0:
weights_1, weights_2 = backProp(prev_obs, weights_1, weights_2, np.array([[experimental_value, calculated_value[0,1]]]))
else:
weights_1, weights_2 = backProp(prev_obs, weights_1, weights_2, np.array([[calculated_value[0,0],experimental_value]]))
EDIT: the main loop -
record = 0
total = 0
for i_episode in range(num_episodes):
if (i_episode%10 == 0):
print("W1 = ", weights_1)
print("W2 = ", weights_2)
observation = env.reset()
epsilon = max(epsilon*0.9,0.01)
lr = max(lr*0.9, 0.01)
print("Average steps = ", total/(i_episode+1))
print("Record = ", record)
for t in range(1000):
action_taken = selectAction(observation)
print(action_taken)
previous_observation=observation
observation, reward, done, info = env.step(action_taken) # take the selected action
updateWeights(previous_observation, action_taken, observation,reward, done) # perform backprop to update the action value
if done:
total = total+t
if t > record:
record = t
print("Episode {} finished after {} timesteps".format(i_episode,t+1))
break
Do I need to make any changes in approach/implementation/parameter tuning?
Related
I'm trying to do my own neural network implementation from scratch, but I'm having some problems. Specifically, with the way one of the terms grows exponentially as iterations advance, which makes the accuracy quickly reach a plateau and gradient descent not find any optimal solution.
The equations I'm using I reached on my own, while studying some of the available resources online. I chose to write my own equations so I could understand them better:
... (1) where L is the last layer, and the hadamard product is being performed, and A is the values from the Lth layer after the activation function, Z is the values before the activation function is applied.
... (2) where W is the matrix of weights from the i+1th layer, with the number of rows is equal to the number of neurons in the i+1th layer and columns equals the number of neurones in the ith layer.
... (3)
... (4)
I struggled a lot with these, but I finally was able to make it work and keep all the correct dimensions of each matrix. Now, implementing these equations into my code:
def back_prop(self):
delta = np.multiply(self.a_record[-1] - self.target, self.layers[-1].prime(self.z_record[-1]))
for i, layer in reversed(list(enumerate(self.layers))):
dw = np.matmul(delta, self.a_record[i].T)
db = np.matmul(delta, np.ones((self.n, 1)))
self.dw_record.append(dw)
self.db_record.append(db)
if i > 0:
delta = np.multiply(np.matmul(self.w[i].T, delta), layer.prime(self.z_record[i - 1]))
a_record and z_record are lists that store the arrays corresponding to each layer. Same for dw and db records. I wrote it like this so it is as close as possible to my equations.
Problem comes once I start to measure performance on any dataset, which remain stagnant after just a couple of iterations. I get this warning, since it's not an error it doesn't stop the script.
RuntimeWarning: overflow encountered in multiply delta = np.multiply(self.a_record[-1] - self.target, self.layers[-1].prime(self.z_record[-1]))
When I analize the values that come from delta I find that the value of delta is growing way too fast, it starts with small values but grows up to e+200 magnitude and then it just returns infinity, which makes the next delta a bunch of np.nan and all the weights and biases to stop changing.
I'm fairly certain that the equations are equivalent to other equations I found in other neural networks, so I don't really know why this is happening. I don't imagine the values from any randomly generated dataset to be so insanely big that it overflows after just a couple of iterations.
I will post the whole class I made:
class Network():
def __init__(self, input, target, layers, alpha = 0.1, iter= 1000):
self.input = np.array(input).T
self.y_true = target
#self.target = np.array(one_hot(target)).T
self.target = np.array(target).T
self.layers = layers
self.alpha = alpha
self.iter = iter
self.n = len(input)
self.w = []
self.b = []
for i, layer in enumerate(layers):
if i == 0:
self.w.append(np.random.rand(layer.neurons, len(self.input)) - 0.5)
else:
self.w.append(np.random.rand(layer.neurons, layers[i - 1].neurons) - 0.5)
self.b.append(np.random.rand(layer.neurons, 1) - 0.5)
self.a_record = []
self.z_record = []
self.dw_record = []
self.db_record = []
def forward_prop(self):
a = self.input
self.a_record.append(a)
for i, layer in enumerate(self.layers):
z = np.matmul(self.w[i], a) + self.b[i].reshape(-1, 1)
a = layer.func(z)
self.z_record.append(z)
self.a_record.append(a)
self.output = a
def back_prop(self):
delta = np.multiply(self.a_record[-1] - self.target, self.layers[-1].prime(self.z_record[-1]))
print(delta)
for i, layer in reversed(list(enumerate(self.layers))):
dw = np.matmul(delta, self.a_record[i].T)
db = np.matmul(delta, np.ones((self.n, 1)))
self.dw_record.append(dw)
self.db_record.append(db)
print(self.a_record[i].T)
print("------------------------------------------")
if i > 0:
delta = np.multiply(np.matmul(self.w[i].T, delta), layer.prime(self.z_record[i - 1]))
def gradient_desc(self):
w_copy = self.w
for i, (w, dw) in enumerate(zip(w_copy, self.dw_record[::-1])):
self.w[i] = w - self.alpha*dw
b_copy = self.b
for i, (b, db) in enumerate(zip(b_copy, self.db_record[::-1])):
self.b[i] = b - self.alpha*db
def fit(self):
for i in range(self.iter):
self.forward_prop()
self.back_prop()
self.gradient_desc()
if i % 40 == 0:
print(f"Iteration: {i + 1} / {self.iter} =====================================")
print(f"Accuracy: {get_accuracy(self.output, self.y_true)}")
I've been trying for a couple days now to figure this out, but I don't seem to find the error or maybe a workaround to prevent overflow. Maybe the equations are wrong from the start? Please any help would be greatly appreciated, thank you in advance.
I have the following code:
class k_armed_bandit:
def __init__(self, epsilon, k, steps):
self.batches = 4000
self.epsilon = epsilon
self.k = k
self.steps = steps
self.mean = 4 * torch.rand((self.batches, self.k), device=device) - 2
self.var = torch.rand((self.batches, self.k), device=device) + 0.5
self.estimates = 4 * torch.ones((self.batches, self.k), device=device)
self.counts = torch.ones((self.batches, self.k), device=device)
def run(self):
rewards = torch.zeros(self.steps, device=device)
for i in range(self.steps):
pos = torch.where(
torch.rand(self.batches, device=device) < self.epsilon,
torch.randint(0, k, size=(self.batches, ), device=device),
torch.argmax(self.estimates, dim=1)
)
pos_mask = F.one_hot(pos, num_classes=self.k)
val = torch.normal(mean=torch.sum(pos_mask * self.mean, dim=1), std=torch.sum(pos_mask * self.var, dim=1))
self.counts += pos_mask
val = val[:, None] * pos_mask
self.estimates += ((val - (self.estimates * pos_mask)) / self.counts) * pos_mask
rewards[i] = torch.sum(val) / self.batches
return rewards
If you're not familiar with the problem, the idea is that you have k options in front of you, and you are trying to maximize the sum of rewards received from the options across some number of turns. The reward from option i is sampled from N(mean[i], var[i]). You keep a running track of returns from each option (stored in self.estimates, which is updated in streaming fashion). With probability epsilon you pick a random option, and with probability 1 - epsilon you simply pick the option with the best return (the position selected is stored in pos). At the end, you will return the sequence of rewards you obtained from the options you selected.
Anyways, the code runs pretty fast, but I suspect that some of my naive mistakes make this slower than it should be. First of all, my torch.where requires that for every agent in the batch, both a random position and the argmax should be computed, even though only one is used. This can't be good. Next, pos_mask is a big problem. It allows me to update the averages each agent keeps track of, only if they were picked as the sampled position. It also allows me to update the number of times each option was selected. The issue is that pos_mask, and all related computations, are much larger than they need to be, as it a mask and I cannot seem to update the relevant indices in parallel. I tried to fiddle with torch's index related methods, but none of them seem to help.
Any clues how to speed this up? Or maybe this problem is just not nice to implement on GPU?
I have a given dataset in the matrix y and I want to train different SOMs with it. The SOM is one-dimensional (a line), and its number of neurons varies. I train a SOM of size N=2 at first, and N=NMax at last, giving a total of NMax-2+1 SOMs. For each SOM, I want to store the weights once the training is over before moving on to the next SOM.
The whole point of using PyOpenCL here is that each one of the outer loops is independent of the others. Namely, for each value of N, the script doesn't care about what happens when N takes other values. One could have the same result running the script NMax-2+1 times changing the value of N manually.
With this in mind, I was hoping to be able to perform each one of these independent iterations at the same time using the GPU, so that the time spent reduces significantly. The increase in speed will be less than 1/(NMax-2+1) though, because each iteration is more expensive that the previous ones, as for larger values of N, more calculations are made.
Is there a way to 'translate' this code to run on the GPU? I've never used OpenCL before, so let me know if this is too broad or silly so I can ask a more specific question. The code is self-contained, so feel free to try it out.The four constants declared at the beginning can be changed to whatever you like (given that NMax > 1 and all the others are strictly positive).
import numpy as np
import time
m = 3 # Dimension of datapoints
num_points = 2000 # Number of datapoints
iterMax = 150 # Maximum number of iterations
NMax = 3 # Maximum number of neurons
#%%
np.random.seed(0)
y = np.random.rand(num_points,m) # Generate always the same dataset
sigma_0 = 5 # Initial value of width of the neighborhood function
eta_0 = 1 # Initial value of learning rate
w = list(range(NMax - 1))
wClusters = np.zeros((np.size(y,axis = 0),NMax - 1)) # Clusters for each N
t_begin = time.clock() # Start time
for N in range(NMax-1): # Number of neurons for this iteration
w[N] = np.random.uniform(0,1,(N+2,np.size(y,axis=1))) - 0.5 # Initialize weights
iterCount = 1
while iterCount < iterMax:
# Mix up the input patterns
mixInputs = y[np.random.permutation(np.size(y,axis = 0)),:]
# Sigma reduction
sigma = sigma_0 - (sigma_0/(iterMax + 1)) * iterCount
s2 = 2*sigma**2
# Learning rate reduction
eta = eta_0 - (eta_0/(iterMax + 1)) * iterCount
for selectedInput in mixInputs: # Pick up one pattern
# Search winning neuron
aux = np.sum((selectedInput - w[N])**2, axis = -1)
ii = np.argmin(aux) # Neuron 'ii' is the winner
jjs = abs(ii - list(range(N+2)))
dists = np.min(np.vstack([jjs , abs(jjs-(N+2))]), axis = 0)
# Update weights
w[N] = w[N] + eta * np.exp((-dists**2)/s2).T[:,np.newaxis] * (selectedInput - w[N])
print(N+2,iterCount)
iterCount += 1
# Assign each datapoint to its nearest neuron
for kk in range(np.size(y,axis = 0)):
aux = np.sum((y[kk,] - w[N])**2,axis=-1)
ii = np.argmin(aux) # Neuron 'ii' is the winner
wClusters[kk,N] = ii + 1
t_end = time.clock() # End time
#%%
print(t_end - t_begin)
I'm trying to give a somewhat complete answer.
First of all:
Can this code be adapted to be run on the GPU using (py)OpenCL?
Most probably yes.
Can this been done automatically?
No (afaik).
Most of the questions I get about OpenCL are along the lines of: "Is it worth porting this piece of code to OpenCL for a speedup gain?" You are stating, that your outer loop is independent on the results of other runs, which makes the code basically parallelizable. In a straightforward implementation, each OpenCL working element would execute the same code with slightly different input parameters. Not regarding overhead by data transfer between host and device, the running time of this approach would be equal to the running time of the slowest iteration. Depending on the iterations in your outer loop, this could be a massive speed gain. As long as the numbers stay relatively small, you could try the multiprocessing module in python to parallelize these iterations on the CPU instead of the GPU.
Porting to the GPU usually only makes sense, if a huge number of processes are to be run in parallel (about 1000 or more). So in your case, if you really want an enormous speed boost, see if you can parallelize all calculations inside the loop. For example, you have 150 iterations and 2000 data points. If you could somehow parallelize these 2000 data points, this could offer a much bigger speed gain, which could justify the work of porting the whole code to OpenCL.
TL;DR:
Try parallelizing on CPU first. If you find the need to run more than several 100s of processes at the same time, move to GPU.
Update: Simple code for parallelizing on CPU using multiprocessing (without callback)
import numpy as np
import time
import multiprocessing as mp
m = 3 # Dimension of datapoints
num_points = 2000 # Number of datapoints
iterMax = 150 # Maximum number of iterations
NMax = 10 # Maximum number of neurons
#%%
np.random.seed(0)
y = np.random.rand(num_points,m) # Generate always the same dataset
sigma_0 = 5 # Initial value of width of the neighborhood function
eta_0 = 1 # Initial value of learning rate
w = list(range(NMax - 1))
wClusters = np.zeros((np.size(y,axis = 0),NMax - 1)) # Clusters for each N
def neuron_run(N):
w[N] = np.random.uniform(0,1,(N+2,np.size(y,axis=1))) - 0.5 # Initialize weights
iterCount = 1
while iterCount < iterMax:
# Mix up the input patterns
mixInputs = y[np.random.permutation(np.size(y,axis = 0)),:]
# Sigma reduction
sigma = sigma_0 - (sigma_0/(iterMax + 1)) * iterCount
s2 = 2*sigma**2
# Learning rate reduction
eta = eta_0 - (eta_0/(iterMax + 1)) * iterCount
for selectedInput in mixInputs: # Pick up one pattern
# Search winning neuron
aux = np.sum((selectedInput - w[N])**2, axis = -1)
ii = np.argmin(aux) # Neuron 'ii' is the winner
jjs = abs(ii - list(range(N+2)))
dists = np.min(np.vstack([jjs , abs(jjs-(N+2))]), axis = 0)
# Update weights
w[N] = w[N] + eta * np.exp((-dists**2)/s2).T[:,np.newaxis] * (selectedInput - w[N])
print(N+2,iterCount)
iterCount += 1
# Assign each datapoint to its nearest neuron
for kk in range(np.size(y,axis = 0)):
aux = np.sum((y[kk,] - w[N])**2,axis=-1)
ii = np.argmin(aux) # Neuron 'ii' is the winner
wClusters[kk,N] = ii + 1
t_begin = time.clock() # Start time
#%%
def apply_async():
pool = mp.Pool(processes=NMax)
for N in range(NMax-1):
pool.apply_async(neuron_run, args = (N,))
pool.close()
pool.join()
print "Multiprocessing done!"
if __name__ == '__main__':
apply_async()
t_end = time.clock() # End time
print(t_end - t_begin)
First of all I wanna say that I am a python beginner and also completely new to neural networks. When I read about it I was very excited and thought I set up a little code from scratch (see code below).
But somehow my code is not working properly. I guess there are some major bugs (in the algorithm and the programming?). But I cannot find them at the moment.
So, in the handwritten notes you can see my system (and some formulas). I wanna solve a decision problem where I have data in the form of X=(x1,x2) and y (which is 0 or 1).
My network has one hidden layer consisting of 3 neurons and one output layer.
As an activation function I use sigmoid and for the loss I use cross entropy (sth like log likelihood for bernoulli, I guess?)
The neurons take the weighted input W.X + bias and return a scalar between 0,1.
For the learning process I tried to use backward propagation. So I just computed the derivative dLoss/dparams and applied the chain rule several times. In order not to make everything in index notation I tried to use numpy to handle matrices, etc.
Maybe someone sees directly the things I did wrong? (apart from the bad programming :D)
Handwritten notes 1/2
Handwritten notes 2/2
#!/usr/bin/python
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
## create random data set for decision problem
np.random.seed(0) #fixed seed to reproduce results
X, y = datasets.make_moons(20, noise=0.20) # lists containing the Data
plt.scatter(X[:,0], X[:,1], s=40, c=y, cmap=plt.cm.Spectral) # plot it
plt.show() # show plot; proceeds when plot is closed
## initialize model parameters
W1 = np.random.uniform(-0.5,0.5,[3,2]) # hidden layer weights (3 x 2) matrix
b1 = np.random.uniform(-1,1,[3]) # bias for neurons in hidden layer
W2 = np.random.uniform(-0.5,0.5,[1,3]) # weights for output layer (1 x 3)
b2 = np.random.uniform(-1,1,[1]) # bias for output neuron
# collecting parameters in model dict
model = {"W1" : W1, "W2" : W2, "b1" : b1, "b2" : b2}
## the activation function
# can also return the derivative
def sigmoid(x,derivative = False):
if derivative == True:
# derivative; np.multiply multiplies element-wise
# needed if x is tensor-like object
return np.multiply(sigmoid(x), (1 - sigmoid(x)))
else:
return 1.0/(1.0 + np.exp(-x))
## moving forward in the network for a single data point
# and returns a dict with necessary information
def move_forward(model, DataX):
W1 = model["W1"] # extract model params from dict to make it better readable
W2 = model["W2"]
b1 = model["b1"]
b2 = model["b2"]
t1 = np.dot(W1,DataX) + b1 # weighted input for hidden layer (here 3-dim object)
phi = sigmoid(t1) # evaluate activation function
phiP = sigmoid(t1, True) # derivative (needed for moving backward "learning")
t2 = np.dot(W2,phi) + b2 # weighted input for output layer (1-dim object)
sig = sigmoid(t2) # evaluate final output
sigP = sigmoid(t2, True) # derivative
forward = {"phi" : phi,"phiP" : phiP, # dict collecting the output
"sig" : sig, "sigP" : sigP}
return forward
## moving backward for a single data point
def move_backward(forward, model, DataX):
W1 = model["W1"]
W2 = model["W2"]
b1 = model["b1"]
b2 = model["b2"]
phi = forward["phi"]
phiP = forward["phiP"]
sig = forward["sig"]
sigP = forward["sigP"]
#not the full deltaWs / deltabs; multiplied by the rest in "update_model"
dW2 = sigP * phi # part from "derivative chain" roughly: dsig/dt2 dt2 / dW2
db2 = sigP # analogue
temp = np.multiply(W2,phiP) # multiplied element wise
dW1 = sigP * np.outer(temp, DataX) # outer product since: (W2 * phi)_j x_i
db1 = sigP * np.outer(temp, [1]) # analogue
backward = {"dW1": dW1, "dW2": dW2, "db1": db1, "db2": db2}
return backward
## part of the loss function; here for one data point
# returns also the derivative for the learning process
def loss(DataY, PredictionY, derivative = False):
if derivative == True:
return DataY / PredictionY - (1.0 - DataY) / (1.0 - PredictionY)
log_likelihood = DataY * np.log(PredictionY) + (1.0 - DataY) * np.log(1.0 - PredictionY)
return log_likelihood
## updating model parameters
## epsilon is a small parameter regulating the learning
def update_model(DataSet,model, epsilon):
DataX = DataSet[0]
DataY = DataSet[1]
total_loss = 0
dW1_total = 0
dW2_total = 0
db1_total = 0
db2_total = 0
beta = 0
W1 = model["W1"]
W2 = model["W2"]
b1 = model["b1"]
b2 = model["b2"]
# iterating over full data set
for i in range(len(DataX)):
forward = move_forward(model, DataX[i])
backward = move_backward(forward, model, DataX[i])
sig = forward["sig"]
total_loss += loss(DataY[i],sig)
beta += loss(DataY[i],sig, True)
dW1_total += backward["dW1"]
dW2_total += backward["dW2"]
db1_total += backward["db1"]
db2_total += backward["db2"]
total_loss *= -1.0/len(DataX) # the total loss
beta *= -1.0/len(DataX) # the derivative of dloss/dsig
## setting updated model params
W1_new = W1 - epsilon * beta * dW1_total
W2_new = W2 - epsilon * beta * dW2_total
b1_new = b1 - epsilon * beta * np.squeeze(np.asarray(db1_total))
b2_new = b2 - epsilon * beta * db2_total
model_updated = {"W1": W1_new, "W2": W2_new, "b1": b1_new,
"b2": b2_new, "loss": total_loss}
return model_updated
## train the model with a given data set N times
def train_model(DataSet,model, epsilon, N, print_state = False):
for i in range(N):
model = update_model(DataSet,model, epsilon)
if print_state == True:
if i % 100 == 0:
print(model)
print("loss = " , model["loss"])
print(model)
return model
## call the training function and store the output
model_new = train_model([X,y],model, 0.01, 1000, True)
## check result with data point in the training set
move_forward(model_new,X[0])
# Note: Hm, somehow I always get sig = 0.5 (roughly). And the loss
# does not get smaller than 0.68
# I guess there must be several mistakes
I am trying to implement gradient descent on a dataset. Even though I tried everything, I couldn't make it work. So, I created a test case. I am trying my code on a random data and try to debug.
More specifically, what I am doing is, I am generating random vectors between 0-1 and random labels for these vectors. And try to over-fit the training data.
However, my weight vector gets bigger and bigger in each iteration. And then, I have infinities. So, I do not actually learn anything. Here is my code:
import numpy as np
import random
def getRandomVector(n):
return np.random.uniform(0,1,n)
def getVectors(m, n):
return [getRandomVector(n) for i in range(n)]
def getLabels(n):
return [random.choice([-1,1]) for i in range(n)]
def GDLearn(vectors, labels):
maxIterations = 100
stepSize = 0.01
w = np.zeros(len(vectors[0])+1)
for i in range(maxIterations):
deltaw = np.zeros(len(vectors[0])+1)
for i in range(len(vectors)):
temp = np.append(vectors[i], -1)
deltaw += ( labels[i] - np.dot(w, temp) ) * temp
w = w + ( stepSize * (-1 * deltaw) )
return w
vectors = getVectors(100, 30)
labels = getLabels(100)
w = GDLearn(vectors, labels)
print w
I am using LMS for loss function. So, in all iterations, my update is the following,
where w^i is the ith weight vector and R is the stepSize and E(w^i) is the loss function.
Here is my loss function. (LMS)
and here is how I derivated the loss function,
,
Now, my questions are:
Should I expect good results in this random scenario using Gradient Descent? (What is the theoretical bounds?)
If yes, what is my bug in my implementation?
PS: I tried several other maxIterations and stepSize parameters. Still not working.
PS2: This is the best way I can ask the question here. Sorry if the question is too specific. But it made me crazy. I really want to learn the problem.
Your code has a couple of faults:
In GetVectors() method, you did not actually use the input variable m;
In GDLearn() method, you have a double loop, but you use the same variable i as the loop variables in both loops. (I guess the logic is still right, but it's confusing).
The prediction error (labels[i] - np.dot(w, temp)) has the wrong sign.
Step size does matters. If I am using 0.01 as step size, the cost is increasing in each iteration. Changing it to be 0.001 solved the problem.
Here is my revised code based on your original code.
import numpy as np
import random
def getRandomVector(n):
return np.random.uniform(0,1,n)
def getVectors(m, n):
return [getRandomVector(n) for i in range(m)]
def getLabels(n):
return [random.choice([-1,1]) for i in range(n)]
def GDLearn(vectors, labels):
maxIterations = 100
stepSize = 0.001
w = np.zeros(len(vectors[0])+1)
for iter in range(maxIterations):
cost = 0
deltaw = np.zeros(len(vectors[0])+1)
for i in range(len(vectors)):
temp = np.append(vectors[i], -1)
prediction_error = np.dot(w, temp) - labels[i]
deltaw += prediction_error * temp
cost += prediction_error**2
w = w - stepSize * deltaw
print 'cost at', iter, '=', cost
return w
vectors = getVectors(100, 30)
labels = getLabels(100)
w = GDLearn(vectors, labels)
print w
Running result -- you can see the cost is decreasing with each iteration but with a diminishing return.
cost at 0 = 100.0
cost at 1 = 99.4114482617
cost at 2 = 98.8476022685
cost at 3 = 98.2977744556
cost at 4 = 97.7612851154
cost at 5 = 97.2377571222
cost at 6 = 96.7268325883
cost at 7 = 96.2281642899
cost at 8 = 95.7414151147
cost at 9 = 95.2662577529
cost at 10 = 94.8023744037
......
cost at 90 = 77.367904046
cost at 91 = 77.2744249433
cost at 92 = 77.1823702888
cost at 93 = 77.0917090883
cost at 94 = 77.0024111475
cost at 95 = 76.9144470493
cost at 96 = 76.8277881325
cost at 97 = 76.7424064707
cost at 98 = 76.6582748518
cost at 99 = 76.5753667579
[ 0.16232142 -0.2425511 0.35740632 0.22548442 0.03963853 0.19595213
0.20080207 -0.3921798 -0.0238925 0.13097533 -0.1148932 -0.10077534
0.00307595 -0.30111942 -0.17924479 -0.03838637 -0.23938181 0.1384443
0.22929163 -0.0132466 0.03325976 -0.31489526 0.17468025 0.01351012
-0.25926117 0.09444201 0.07637793 -0.05940019 0.20961315 0.08491858
0.07438357]