I have the following code:
class k_armed_bandit:
def __init__(self, epsilon, k, steps):
self.batches = 4000
self.epsilon = epsilon
self.k = k
self.steps = steps
self.mean = 4 * torch.rand((self.batches, self.k), device=device) - 2
self.var = torch.rand((self.batches, self.k), device=device) + 0.5
self.estimates = 4 * torch.ones((self.batches, self.k), device=device)
self.counts = torch.ones((self.batches, self.k), device=device)
def run(self):
rewards = torch.zeros(self.steps, device=device)
for i in range(self.steps):
pos = torch.where(
torch.rand(self.batches, device=device) < self.epsilon,
torch.randint(0, k, size=(self.batches, ), device=device),
torch.argmax(self.estimates, dim=1)
)
pos_mask = F.one_hot(pos, num_classes=self.k)
val = torch.normal(mean=torch.sum(pos_mask * self.mean, dim=1), std=torch.sum(pos_mask * self.var, dim=1))
self.counts += pos_mask
val = val[:, None] * pos_mask
self.estimates += ((val - (self.estimates * pos_mask)) / self.counts) * pos_mask
rewards[i] = torch.sum(val) / self.batches
return rewards
If you're not familiar with the problem, the idea is that you have k options in front of you, and you are trying to maximize the sum of rewards received from the options across some number of turns. The reward from option i is sampled from N(mean[i], var[i]). You keep a running track of returns from each option (stored in self.estimates, which is updated in streaming fashion). With probability epsilon you pick a random option, and with probability 1 - epsilon you simply pick the option with the best return (the position selected is stored in pos). At the end, you will return the sequence of rewards you obtained from the options you selected.
Anyways, the code runs pretty fast, but I suspect that some of my naive mistakes make this slower than it should be. First of all, my torch.where requires that for every agent in the batch, both a random position and the argmax should be computed, even though only one is used. This can't be good. Next, pos_mask is a big problem. It allows me to update the averages each agent keeps track of, only if they were picked as the sampled position. It also allows me to update the number of times each option was selected. The issue is that pos_mask, and all related computations, are much larger than they need to be, as it a mask and I cannot seem to update the relevant indices in parallel. I tried to fiddle with torch's index related methods, but none of them seem to help.
Any clues how to speed this up? Or maybe this problem is just not nice to implement on GPU?
Related
I'm trying to do my own neural network implementation from scratch, but I'm having some problems. Specifically, with the way one of the terms grows exponentially as iterations advance, which makes the accuracy quickly reach a plateau and gradient descent not find any optimal solution.
The equations I'm using I reached on my own, while studying some of the available resources online. I chose to write my own equations so I could understand them better:
... (1) where L is the last layer, and the hadamard product is being performed, and A is the values from the Lth layer after the activation function, Z is the values before the activation function is applied.
... (2) where W is the matrix of weights from the i+1th layer, with the number of rows is equal to the number of neurons in the i+1th layer and columns equals the number of neurones in the ith layer.
... (3)
... (4)
I struggled a lot with these, but I finally was able to make it work and keep all the correct dimensions of each matrix. Now, implementing these equations into my code:
def back_prop(self):
delta = np.multiply(self.a_record[-1] - self.target, self.layers[-1].prime(self.z_record[-1]))
for i, layer in reversed(list(enumerate(self.layers))):
dw = np.matmul(delta, self.a_record[i].T)
db = np.matmul(delta, np.ones((self.n, 1)))
self.dw_record.append(dw)
self.db_record.append(db)
if i > 0:
delta = np.multiply(np.matmul(self.w[i].T, delta), layer.prime(self.z_record[i - 1]))
a_record and z_record are lists that store the arrays corresponding to each layer. Same for dw and db records. I wrote it like this so it is as close as possible to my equations.
Problem comes once I start to measure performance on any dataset, which remain stagnant after just a couple of iterations. I get this warning, since it's not an error it doesn't stop the script.
RuntimeWarning: overflow encountered in multiply delta = np.multiply(self.a_record[-1] - self.target, self.layers[-1].prime(self.z_record[-1]))
When I analize the values that come from delta I find that the value of delta is growing way too fast, it starts with small values but grows up to e+200 magnitude and then it just returns infinity, which makes the next delta a bunch of np.nan and all the weights and biases to stop changing.
I'm fairly certain that the equations are equivalent to other equations I found in other neural networks, so I don't really know why this is happening. I don't imagine the values from any randomly generated dataset to be so insanely big that it overflows after just a couple of iterations.
I will post the whole class I made:
class Network():
def __init__(self, input, target, layers, alpha = 0.1, iter= 1000):
self.input = np.array(input).T
self.y_true = target
#self.target = np.array(one_hot(target)).T
self.target = np.array(target).T
self.layers = layers
self.alpha = alpha
self.iter = iter
self.n = len(input)
self.w = []
self.b = []
for i, layer in enumerate(layers):
if i == 0:
self.w.append(np.random.rand(layer.neurons, len(self.input)) - 0.5)
else:
self.w.append(np.random.rand(layer.neurons, layers[i - 1].neurons) - 0.5)
self.b.append(np.random.rand(layer.neurons, 1) - 0.5)
self.a_record = []
self.z_record = []
self.dw_record = []
self.db_record = []
def forward_prop(self):
a = self.input
self.a_record.append(a)
for i, layer in enumerate(self.layers):
z = np.matmul(self.w[i], a) + self.b[i].reshape(-1, 1)
a = layer.func(z)
self.z_record.append(z)
self.a_record.append(a)
self.output = a
def back_prop(self):
delta = np.multiply(self.a_record[-1] - self.target, self.layers[-1].prime(self.z_record[-1]))
print(delta)
for i, layer in reversed(list(enumerate(self.layers))):
dw = np.matmul(delta, self.a_record[i].T)
db = np.matmul(delta, np.ones((self.n, 1)))
self.dw_record.append(dw)
self.db_record.append(db)
print(self.a_record[i].T)
print("------------------------------------------")
if i > 0:
delta = np.multiply(np.matmul(self.w[i].T, delta), layer.prime(self.z_record[i - 1]))
def gradient_desc(self):
w_copy = self.w
for i, (w, dw) in enumerate(zip(w_copy, self.dw_record[::-1])):
self.w[i] = w - self.alpha*dw
b_copy = self.b
for i, (b, db) in enumerate(zip(b_copy, self.db_record[::-1])):
self.b[i] = b - self.alpha*db
def fit(self):
for i in range(self.iter):
self.forward_prop()
self.back_prop()
self.gradient_desc()
if i % 40 == 0:
print(f"Iteration: {i + 1} / {self.iter} =====================================")
print(f"Accuracy: {get_accuracy(self.output, self.y_true)}")
I've been trying for a couple days now to figure this out, but I don't seem to find the error or maybe a workaround to prevent overflow. Maybe the equations are wrong from the start? Please any help would be greatly appreciated, thank you in advance.
I have been trying to make my own neural networks from scratch. After some time, I made it, but I run into a problem I cannot solve. I have been following a tutorial which shows how to do this. The problem I run into, was how my network updates weights and biases. Well, I know that gradient descent won't be always decreasing loss and for a few epochs it might even increase a bit, bit it still should decrease and work much better than mine does. Sometimes the whole process gets stuck on loss 9 and 13 and it cannot get out of it. I have checked many tutorials, videos and websites, but I couldn't find anything wrong in my code.
self.activate, self.dactivate, self.loss and self.dloss:
# sigmoid
self.activate = lambda x: np.divide(1, 1 + np.exp(-x))
self.dactivate = lambda x: np.multiply(self.activate(x), (1 - self.activate(x)))
# relu
self.activate = lambda x: np.where(x > 0, x, 0)
self.dactivate = lambda x: np.where(x > 0, 1, 0)
# loss I use (cross-entropy)
clip = lambda x: np.clip(x, 1e-10, 1 - 1e-10) # it's used to squeeze x into a probability between 0 and 1 (which I think is required)
self.loss = lambda x, y: -(np.sum(np.multiply(y, np.log(clip(x))) + np.multiply(1 - y, np.log(1 - clip(x))))/y.shape[0])
self.dloss = lambda x, y: -(np.divide(y, clip(x)) - np.divide(1 - y, 1 - clip(x)))
The code I use for forwardpropagation:
self.activate(np.dot(X, self.weights) + self.biases) # it's an example for first hidden layer
And that's the code for backpropagation:
First part, in DenseNeuralNetwork class:
last_derivative = self.dloss(output, y)
for layer in reversed(self.layers):
last_derivative = layer.backward(last_derivative, self.lr)
And the second part, in Dense class:
def backward(self, last_derivative, lr):
w = self.weights
dfunction = self.dactivate(last_derivative)
d_w = np.dot(self.layer_input.T, dfunction) * (1./self.layer_input.shape[1])
d_b = (1./self.layer_input.shape[1]) * np.dot(np.ones((self.biases.shape[0], last_derivative.shape[0])), last_derivative)
self.weights -= np.multiply(lr, d_w)
self.biases -= np.multiply(lr, d_b)
return np.dot(dfunction, w.T)
I have also made a repl so you can check the whole code and run it without any problems.
1.
line 12
self.dloss = lambda x, y: -(np.divide(y, clip(x)) - np.divide(1 - y, 1 - clip(x)))
if you're going to clip x, you shoud clip y too.
I mean there are some ways to implement this, but if you are going to use this way.
change to
self.dloss = lambda x, y: -(np.divide(clip(y), clip(x)) - np.divide(1 - clip(y), 1 - clip(x)))
2.
line 75
dfunction = self.dactivate(last_derivative)
this back propagation part is just wrong.
change to
dfunction = last_derivative*self.dactivate(np.dot(self.layer_input, self.weights) + self.biases)
3.
line 77
d_b = (1./self.layer_input.shape[1]) * np.dot(np.ones((self.biases.shape[0], last_derivative.shape[0])), last_derivative)
last_derivative should be dfunction. I think this is just a mistake.
change to
d_b = (1./self.layer_input.shape[1]) * np.dot(np.ones((self.biases.shape[0], last_derivative.shape[0])), dfunction)
4.
line 85
self.weights = np.random.randn(neurons, self.neurons) * np.divide(6, np.sqrt(self.neurons * neurons))
self.biases = np.random.randn(1, self.neurons) * np.divide(6, np.sqrt(self.neurons * neurons))
Not sure where you are going with this, but I think the initialized values are too big. We're not doing precise hypertuning, so I just made it small.
self.weights = np.random.randn(neurons, self.neurons) * np.divide(6, np.sqrt(self.neurons * neurons)) / 100
self.biases = np.random.randn(1, self.neurons) * np.divide(6, np.sqrt(self.neurons * neurons)) / 100
All good now
After this I changed the learning rate to 0.01 because it was to slow, and it worked fine.
I think you are misunderstanding back propagation. You should probably double check how it works. The other parts are ok I think.
This can be caused by your training data. Either it is too small or too many diverse labels (What i get from your code from the link you share).
I re-run your code several times and it produce different training performance. Sometimes the loss keeps decreasing until last epoch, some times it keep increasing, in one time it decreased until some point and it increasing. (With minimum loss achieved of 0.5)
I think it is your training data that matters this time. The learning rate is good enough though (Assuming you did the calculation for Linear combination, back propagation, etc right).
I'm trying to solve the CartPole-v1 problem from OpenAI by using backprop on a one-layer neural network - while updating the model at every time step using State action values (Q(s,a)). I'm unable to get the average reward to go up beyond about 42 steps per episode. Could anyone help? Is my approach even correct - as in, is it even possible for the agent to learn the optimal solution if I'm updating the Q-values every time-step, instead of batch updates every episode? Seems like theoretically it should be possible.
Details: After playing around and experimenting with activation functions, stochastic policies and finally settling on a deterministic policy with linear activation function and the parameters mentioned below - i'm able to get my agent to consistently converge (in about 100-300 steps) to an average reward of about 42 steps. But it doesn't go beyond 45. Adjusting the parameters (epsilon, discount_rate, and learning rate) in the program below does not have a huge impact on this.
I've tried looking for a similar solution online but none of them seem to fit the approach that I'm following. Almost all of the solutions involve learning at the end of each episode (by storing SARS' data).
Increasing the number of hidden layers doesn't help either. I also think it is unlikely that the algorithm will converge to a better value in future as I've run it for 10000+ episodes and it average reward is still around 40.
First, the hyperparameters:
epsilon = 0.5
lr = 0.05
discount_rate=0.9
# number of features in environment observations
num_inputs = 4
hidden_layer_nodes = 6
num_outputs = 2
The q function:
def calculateNNOutput(observation, m1, m2):
scaled_observation = scaleFeatures(observation)
hidden_layer = np.dot(scaled_observation, m1) # 1x4 X 4x6 -> 1x6
outputs = np.dot(hidden_layer, m2) # 1x6 X 6x2
return np.asmatrix(outputs) # 1x2
Action selection (policy):
def selectAction(observation):
#explore
global epsilon
if random.uniform(0,1) < epsilon:
return random.randint(0,1)
#exploit
outputs = calculateNNOutputs(observation)
print(outputs)
if (outputs[0,0] > outputs[0,1]):
return 0
else:
return 1
Backprop:
def backProp(prev_obs, m1, m2, experimental_values):
global lr
scaled_observation = np.asmatrix(scaleFeatures(prev_obs))
hidden_layer = np.asmatrix(np.dot(scaled_observation, m1)) #
outputs = np.asmatrix(np.dot(hidden_layer, m2)) # 1x6 X 6x2
delta_out = np.asmatrix((outputs-experimental_values)) # 1x2
delta_2=np.transpose(np.dot(m2,np.transpose(delta_out))) # 6x2 X 2x1 = 6x1_T = 1x6
GRADIENT_2 = (np.transpose(hidden_layer))*delta_out # 6x1 X 1x2 = 6x2 - same as w2
GRADIENT_1 = np.multiply(np.transpose(scaled_observation), delta_2) # 4 x 6 - same as w1
m1 = m1 - lr*GRADIENT_1
m2 = m2 - lr*GRADIENT_2
return m1, m2
Q-learning:
def updateWeights(prev_obs, action, obs, reward, done):
global weights_1, weights_2
calculated_value = calculateNNOutputs(prev_obs)
if done:
experimental_value = -1
else:
actionValues = calculateNNOutputs(obs) # 1x2
experimental_value = reward + discount_rate*(np.amax(actionValues, axis = 1)[0,0])
if action==0:
weights_1, weights_2 = backProp(prev_obs, weights_1, weights_2, np.array([[experimental_value, calculated_value[0,1]]]))
else:
weights_1, weights_2 = backProp(prev_obs, weights_1, weights_2, np.array([[calculated_value[0,0],experimental_value]]))
EDIT: the main loop -
record = 0
total = 0
for i_episode in range(num_episodes):
if (i_episode%10 == 0):
print("W1 = ", weights_1)
print("W2 = ", weights_2)
observation = env.reset()
epsilon = max(epsilon*0.9,0.01)
lr = max(lr*0.9, 0.01)
print("Average steps = ", total/(i_episode+1))
print("Record = ", record)
for t in range(1000):
action_taken = selectAction(observation)
print(action_taken)
previous_observation=observation
observation, reward, done, info = env.step(action_taken) # take the selected action
updateWeights(previous_observation, action_taken, observation,reward, done) # perform backprop to update the action value
if done:
total = total+t
if t > record:
record = t
print("Episode {} finished after {} timesteps".format(i_episode,t+1))
break
Do I need to make any changes in approach/implementation/parameter tuning?
I have a given dataset in the matrix y and I want to train different SOMs with it. The SOM is one-dimensional (a line), and its number of neurons varies. I train a SOM of size N=2 at first, and N=NMax at last, giving a total of NMax-2+1 SOMs. For each SOM, I want to store the weights once the training is over before moving on to the next SOM.
The whole point of using PyOpenCL here is that each one of the outer loops is independent of the others. Namely, for each value of N, the script doesn't care about what happens when N takes other values. One could have the same result running the script NMax-2+1 times changing the value of N manually.
With this in mind, I was hoping to be able to perform each one of these independent iterations at the same time using the GPU, so that the time spent reduces significantly. The increase in speed will be less than 1/(NMax-2+1) though, because each iteration is more expensive that the previous ones, as for larger values of N, more calculations are made.
Is there a way to 'translate' this code to run on the GPU? I've never used OpenCL before, so let me know if this is too broad or silly so I can ask a more specific question. The code is self-contained, so feel free to try it out.The four constants declared at the beginning can be changed to whatever you like (given that NMax > 1 and all the others are strictly positive).
import numpy as np
import time
m = 3 # Dimension of datapoints
num_points = 2000 # Number of datapoints
iterMax = 150 # Maximum number of iterations
NMax = 3 # Maximum number of neurons
#%%
np.random.seed(0)
y = np.random.rand(num_points,m) # Generate always the same dataset
sigma_0 = 5 # Initial value of width of the neighborhood function
eta_0 = 1 # Initial value of learning rate
w = list(range(NMax - 1))
wClusters = np.zeros((np.size(y,axis = 0),NMax - 1)) # Clusters for each N
t_begin = time.clock() # Start time
for N in range(NMax-1): # Number of neurons for this iteration
w[N] = np.random.uniform(0,1,(N+2,np.size(y,axis=1))) - 0.5 # Initialize weights
iterCount = 1
while iterCount < iterMax:
# Mix up the input patterns
mixInputs = y[np.random.permutation(np.size(y,axis = 0)),:]
# Sigma reduction
sigma = sigma_0 - (sigma_0/(iterMax + 1)) * iterCount
s2 = 2*sigma**2
# Learning rate reduction
eta = eta_0 - (eta_0/(iterMax + 1)) * iterCount
for selectedInput in mixInputs: # Pick up one pattern
# Search winning neuron
aux = np.sum((selectedInput - w[N])**2, axis = -1)
ii = np.argmin(aux) # Neuron 'ii' is the winner
jjs = abs(ii - list(range(N+2)))
dists = np.min(np.vstack([jjs , abs(jjs-(N+2))]), axis = 0)
# Update weights
w[N] = w[N] + eta * np.exp((-dists**2)/s2).T[:,np.newaxis] * (selectedInput - w[N])
print(N+2,iterCount)
iterCount += 1
# Assign each datapoint to its nearest neuron
for kk in range(np.size(y,axis = 0)):
aux = np.sum((y[kk,] - w[N])**2,axis=-1)
ii = np.argmin(aux) # Neuron 'ii' is the winner
wClusters[kk,N] = ii + 1
t_end = time.clock() # End time
#%%
print(t_end - t_begin)
I'm trying to give a somewhat complete answer.
First of all:
Can this code be adapted to be run on the GPU using (py)OpenCL?
Most probably yes.
Can this been done automatically?
No (afaik).
Most of the questions I get about OpenCL are along the lines of: "Is it worth porting this piece of code to OpenCL for a speedup gain?" You are stating, that your outer loop is independent on the results of other runs, which makes the code basically parallelizable. In a straightforward implementation, each OpenCL working element would execute the same code with slightly different input parameters. Not regarding overhead by data transfer between host and device, the running time of this approach would be equal to the running time of the slowest iteration. Depending on the iterations in your outer loop, this could be a massive speed gain. As long as the numbers stay relatively small, you could try the multiprocessing module in python to parallelize these iterations on the CPU instead of the GPU.
Porting to the GPU usually only makes sense, if a huge number of processes are to be run in parallel (about 1000 or more). So in your case, if you really want an enormous speed boost, see if you can parallelize all calculations inside the loop. For example, you have 150 iterations and 2000 data points. If you could somehow parallelize these 2000 data points, this could offer a much bigger speed gain, which could justify the work of porting the whole code to OpenCL.
TL;DR:
Try parallelizing on CPU first. If you find the need to run more than several 100s of processes at the same time, move to GPU.
Update: Simple code for parallelizing on CPU using multiprocessing (without callback)
import numpy as np
import time
import multiprocessing as mp
m = 3 # Dimension of datapoints
num_points = 2000 # Number of datapoints
iterMax = 150 # Maximum number of iterations
NMax = 10 # Maximum number of neurons
#%%
np.random.seed(0)
y = np.random.rand(num_points,m) # Generate always the same dataset
sigma_0 = 5 # Initial value of width of the neighborhood function
eta_0 = 1 # Initial value of learning rate
w = list(range(NMax - 1))
wClusters = np.zeros((np.size(y,axis = 0),NMax - 1)) # Clusters for each N
def neuron_run(N):
w[N] = np.random.uniform(0,1,(N+2,np.size(y,axis=1))) - 0.5 # Initialize weights
iterCount = 1
while iterCount < iterMax:
# Mix up the input patterns
mixInputs = y[np.random.permutation(np.size(y,axis = 0)),:]
# Sigma reduction
sigma = sigma_0 - (sigma_0/(iterMax + 1)) * iterCount
s2 = 2*sigma**2
# Learning rate reduction
eta = eta_0 - (eta_0/(iterMax + 1)) * iterCount
for selectedInput in mixInputs: # Pick up one pattern
# Search winning neuron
aux = np.sum((selectedInput - w[N])**2, axis = -1)
ii = np.argmin(aux) # Neuron 'ii' is the winner
jjs = abs(ii - list(range(N+2)))
dists = np.min(np.vstack([jjs , abs(jjs-(N+2))]), axis = 0)
# Update weights
w[N] = w[N] + eta * np.exp((-dists**2)/s2).T[:,np.newaxis] * (selectedInput - w[N])
print(N+2,iterCount)
iterCount += 1
# Assign each datapoint to its nearest neuron
for kk in range(np.size(y,axis = 0)):
aux = np.sum((y[kk,] - w[N])**2,axis=-1)
ii = np.argmin(aux) # Neuron 'ii' is the winner
wClusters[kk,N] = ii + 1
t_begin = time.clock() # Start time
#%%
def apply_async():
pool = mp.Pool(processes=NMax)
for N in range(NMax-1):
pool.apply_async(neuron_run, args = (N,))
pool.close()
pool.join()
print "Multiprocessing done!"
if __name__ == '__main__':
apply_async()
t_end = time.clock() # End time
print(t_end - t_begin)
I am trying to implement gradient descent on a dataset. Even though I tried everything, I couldn't make it work. So, I created a test case. I am trying my code on a random data and try to debug.
More specifically, what I am doing is, I am generating random vectors between 0-1 and random labels for these vectors. And try to over-fit the training data.
However, my weight vector gets bigger and bigger in each iteration. And then, I have infinities. So, I do not actually learn anything. Here is my code:
import numpy as np
import random
def getRandomVector(n):
return np.random.uniform(0,1,n)
def getVectors(m, n):
return [getRandomVector(n) for i in range(n)]
def getLabels(n):
return [random.choice([-1,1]) for i in range(n)]
def GDLearn(vectors, labels):
maxIterations = 100
stepSize = 0.01
w = np.zeros(len(vectors[0])+1)
for i in range(maxIterations):
deltaw = np.zeros(len(vectors[0])+1)
for i in range(len(vectors)):
temp = np.append(vectors[i], -1)
deltaw += ( labels[i] - np.dot(w, temp) ) * temp
w = w + ( stepSize * (-1 * deltaw) )
return w
vectors = getVectors(100, 30)
labels = getLabels(100)
w = GDLearn(vectors, labels)
print w
I am using LMS for loss function. So, in all iterations, my update is the following,
where w^i is the ith weight vector and R is the stepSize and E(w^i) is the loss function.
Here is my loss function. (LMS)
and here is how I derivated the loss function,
,
Now, my questions are:
Should I expect good results in this random scenario using Gradient Descent? (What is the theoretical bounds?)
If yes, what is my bug in my implementation?
PS: I tried several other maxIterations and stepSize parameters. Still not working.
PS2: This is the best way I can ask the question here. Sorry if the question is too specific. But it made me crazy. I really want to learn the problem.
Your code has a couple of faults:
In GetVectors() method, you did not actually use the input variable m;
In GDLearn() method, you have a double loop, but you use the same variable i as the loop variables in both loops. (I guess the logic is still right, but it's confusing).
The prediction error (labels[i] - np.dot(w, temp)) has the wrong sign.
Step size does matters. If I am using 0.01 as step size, the cost is increasing in each iteration. Changing it to be 0.001 solved the problem.
Here is my revised code based on your original code.
import numpy as np
import random
def getRandomVector(n):
return np.random.uniform(0,1,n)
def getVectors(m, n):
return [getRandomVector(n) for i in range(m)]
def getLabels(n):
return [random.choice([-1,1]) for i in range(n)]
def GDLearn(vectors, labels):
maxIterations = 100
stepSize = 0.001
w = np.zeros(len(vectors[0])+1)
for iter in range(maxIterations):
cost = 0
deltaw = np.zeros(len(vectors[0])+1)
for i in range(len(vectors)):
temp = np.append(vectors[i], -1)
prediction_error = np.dot(w, temp) - labels[i]
deltaw += prediction_error * temp
cost += prediction_error**2
w = w - stepSize * deltaw
print 'cost at', iter, '=', cost
return w
vectors = getVectors(100, 30)
labels = getLabels(100)
w = GDLearn(vectors, labels)
print w
Running result -- you can see the cost is decreasing with each iteration but with a diminishing return.
cost at 0 = 100.0
cost at 1 = 99.4114482617
cost at 2 = 98.8476022685
cost at 3 = 98.2977744556
cost at 4 = 97.7612851154
cost at 5 = 97.2377571222
cost at 6 = 96.7268325883
cost at 7 = 96.2281642899
cost at 8 = 95.7414151147
cost at 9 = 95.2662577529
cost at 10 = 94.8023744037
......
cost at 90 = 77.367904046
cost at 91 = 77.2744249433
cost at 92 = 77.1823702888
cost at 93 = 77.0917090883
cost at 94 = 77.0024111475
cost at 95 = 76.9144470493
cost at 96 = 76.8277881325
cost at 97 = 76.7424064707
cost at 98 = 76.6582748518
cost at 99 = 76.5753667579
[ 0.16232142 -0.2425511 0.35740632 0.22548442 0.03963853 0.19595213
0.20080207 -0.3921798 -0.0238925 0.13097533 -0.1148932 -0.10077534
0.00307595 -0.30111942 -0.17924479 -0.03838637 -0.23938181 0.1384443
0.22929163 -0.0132466 0.03325976 -0.31489526 0.17468025 0.01351012
-0.25926117 0.09444201 0.07637793 -0.05940019 0.20961315 0.08491858
0.07438357]