Matrix multiplication quickly overflows in backpropagation equations

Matrix multiplication quickly overflows in backpropagation equations - python

I'm trying to do my own neural network implementation from scratch, but I'm having some problems. Specifically, with the way one of the terms grows exponentially as iterations advance, which makes the accuracy quickly reach a plateau and gradient descent not find any optimal solution.
The equations I'm using I reached on my own, while studying some of the available resources online. I chose to write my own equations so I could understand them better:
... (1) where L is the last layer, and the hadamard product is being performed, and A is the values from the Lth layer after the activation function, Z is the values before the activation function is applied.
... (2) where W is the matrix of weights from the i+1th layer, with the number of rows is equal to the number of neurons in the i+1th layer and columns equals the number of neurones in the ith layer.
... (3)
... (4)
I struggled a lot with these, but I finally was able to make it work and keep all the correct dimensions of each matrix. Now, implementing these equations into my code:
def back_prop(self):
delta = np.multiply(self.a_record[-1] - self.target, self.layers[-1].prime(self.z_record[-1]))
for i, layer in reversed(list(enumerate(self.layers))):
dw = np.matmul(delta, self.a_record[i].T)
db = np.matmul(delta, np.ones((self.n, 1)))
self.dw_record.append(dw)
self.db_record.append(db)
if i > 0:
delta = np.multiply(np.matmul(self.w[i].T, delta), layer.prime(self.z_record[i - 1]))
a_record and z_record are lists that store the arrays corresponding to each layer. Same for dw and db records. I wrote it like this so it is as close as possible to my equations.
Problem comes once I start to measure performance on any dataset, which remain stagnant after just a couple of iterations. I get this warning, since it's not an error it doesn't stop the script.
RuntimeWarning: overflow encountered in multiply delta = np.multiply(self.a_record[-1] - self.target, self.layers[-1].prime(self.z_record[-1]))
When I analize the values that come from delta I find that the value of delta is growing way too fast, it starts with small values but grows up to e+200 magnitude and then it just returns infinity, which makes the next delta a bunch of np.nan and all the weights and biases to stop changing.
I'm fairly certain that the equations are equivalent to other equations I found in other neural networks, so I don't really know why this is happening. I don't imagine the values from any randomly generated dataset to be so insanely big that it overflows after just a couple of iterations.
I will post the whole class I made:
class Network():
def __init__(self, input, target, layers, alpha = 0.1, iter= 1000):
self.input = np.array(input).T
self.y_true = target
#self.target = np.array(one_hot(target)).T
self.target = np.array(target).T
self.layers = layers
self.alpha = alpha
self.iter = iter
self.n = len(input)
self.w = []
self.b = []
for i, layer in enumerate(layers):
if i == 0:
self.w.append(np.random.rand(layer.neurons, len(self.input)) - 0.5)
else:
self.w.append(np.random.rand(layer.neurons, layers[i - 1].neurons) - 0.5)
self.b.append(np.random.rand(layer.neurons, 1) - 0.5)
self.a_record = []
self.z_record = []
self.dw_record = []
self.db_record = []
def forward_prop(self):
a = self.input
self.a_record.append(a)
for i, layer in enumerate(self.layers):
z = np.matmul(self.w[i], a) + self.b[i].reshape(-1, 1)
a = layer.func(z)
self.z_record.append(z)
self.a_record.append(a)
self.output = a
def back_prop(self):
delta = np.multiply(self.a_record[-1] - self.target, self.layers[-1].prime(self.z_record[-1]))
print(delta)
for i, layer in reversed(list(enumerate(self.layers))):
dw = np.matmul(delta, self.a_record[i].T)
db = np.matmul(delta, np.ones((self.n, 1)))
self.dw_record.append(dw)
self.db_record.append(db)
print(self.a_record[i].T)
print("------------------------------------------")
if i > 0:
delta = np.multiply(np.matmul(self.w[i].T, delta), layer.prime(self.z_record[i - 1]))
def gradient_desc(self):
w_copy = self.w
for i, (w, dw) in enumerate(zip(w_copy, self.dw_record[::-1])):
self.w[i] = w - self.alpha*dw
b_copy = self.b
for i, (b, db) in enumerate(zip(b_copy, self.db_record[::-1])):
self.b[i] = b - self.alpha*db
def fit(self):
for i in range(self.iter):
self.forward_prop()
self.back_prop()
self.gradient_desc()
if i % 40 == 0:
print(f"Iteration: {i + 1} / {self.iter} =====================================")
print(f"Accuracy: {get_accuracy(self.output, self.y_true)}")
I've been trying for a couple days now to figure this out, but I don't seem to find the error or maybe a workaround to prevent overflow. Maybe the equations are wrong from the start? Please any help would be greatly appreciated, thank you in advance.

Related

GPU Optimization of k-armed Bandit Problem

I have the following code:
class k_armed_bandit:
def __init__(self, epsilon, k, steps):
self.batches = 4000
self.epsilon = epsilon
self.k = k
self.steps = steps
self.mean = 4 * torch.rand((self.batches, self.k), device=device) - 2
self.var = torch.rand((self.batches, self.k), device=device) + 0.5
self.estimates = 4 * torch.ones((self.batches, self.k), device=device)
self.counts = torch.ones((self.batches, self.k), device=device)
def run(self):
rewards = torch.zeros(self.steps, device=device)
for i in range(self.steps):
pos = torch.where(
torch.rand(self.batches, device=device) < self.epsilon,
torch.randint(0, k, size=(self.batches, ), device=device),
torch.argmax(self.estimates, dim=1)
)
pos_mask = F.one_hot(pos, num_classes=self.k)
val = torch.normal(mean=torch.sum(pos_mask * self.mean, dim=1), std=torch.sum(pos_mask * self.var, dim=1))
self.counts += pos_mask
val = val[:, None] * pos_mask
self.estimates += ((val - (self.estimates * pos_mask)) / self.counts) * pos_mask
rewards[i] = torch.sum(val) / self.batches
return rewards
If you're not familiar with the problem, the idea is that you have k options in front of you, and you are trying to maximize the sum of rewards received from the options across some number of turns. The reward from option i is sampled from N(mean[i], var[i]). You keep a running track of returns from each option (stored in self.estimates, which is updated in streaming fashion). With probability epsilon you pick a random option, and with probability 1 - epsilon you simply pick the option with the best return (the position selected is stored in pos). At the end, you will return the sequence of rewards you obtained from the options you selected.
Anyways, the code runs pretty fast, but I suspect that some of my naive mistakes make this slower than it should be. First of all, my torch.where requires that for every agent in the batch, both a random position and the argmax should be computed, even though only one is used. This can't be good. Next, pos_mask is a big problem. It allows me to update the averages each agent keeps track of, only if they were picked as the sampled position. It also allows me to update the number of times each option was selected. The issue is that pos_mask, and all related computations, are much larger than they need to be, as it a mask and I cannot seem to update the relevant indices in parallel. I tried to fiddle with torch's index related methods, but none of them seem to help.
Any clues how to speed this up? Or maybe this problem is just not nice to implement on GPU?

Neural Network Backpropogation code not working

I need to write a simple neural network that consists of 1 output node, one hidden layer of 3 nodes, and 1 input layer (variable size). For now I am just trying to train on the xor data so lets presume that there are 3 input nodes (one node represents the bias and is always 1). The data is labeled 0,1.
I did out the equations for backpropogation and found that despite being so simple, my code does not converge to the xor data being correct.
Let W be the 3x3 matrix of weights connecting the input and hidden layer, and w be the 1x3 matrix that connects the hidden to output layer. Here are some helper functions for my method
def feed_forward_predict(x, W, w):
sigmoid = lambda x: 1/(1+np.exp(-x))
z = np.array(list(map(sigmoid, np.matmul(W, x))))
L = sigmoid(np.matmul(w, z))
return [L, z, x]
this just takes in a value and makes a prediction using the formula sig(w*sig(W*x)). We also have
def calculate_objective(data, labels, W, w):
obj = 0
for point, label in zip(data, labels):
L, z, x = feed_forward_predict(point, W, w)
obj += (label - L)**2
return obj
which calculates the Mean Squared Error for a bunch of given data points. Both of these functions should work as I checked them by hand. Now the problem comes in for the back propogation algorithm
def back_prop(traindata, trainlabels):
sigmoid = lambda x: 1/(1+np.exp(-x))
sigmoid_prime = lambda x: np.exp(-x)/((1+np.exp(-x))**2)
W = np.random.rand(3, len(traindata[0]))
w = np.random.rand(1, 3)
obj = calculate_objective(traindata, trainlabels, W, w)
print(obj)
epochs = 10_000
eta = .01
prevobj = np.inf
i=0
while(i < epochs):
prevobj = obj
dellw = np.zeros((1,3))
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
dellw += 2*(y - label) * sigmoid_prime(np.dot(w, z)) * z
w -= eta * dellw
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
temp = 2 * (y - label) * sigmoid_prime(np.dot(w, z))
# Note that s,u,v represent the hidden node weights. My professor required it this way
dells = temp * w[0][0] * sigmoid_prime(np.matmul(W[0,:], x)) * x
dellu = temp * w[0][1] * sigmoid_prime(np.matmul(W[1,:], x)) * x
dellv = temp * w[0][2] * sigmoid_prime(np.matmul(W[2,:], x)) * x
dellW = np.array([dells, dellu, dellv])
W -= eta*dellW
obj = calculate_objective(traindata, trainlabels, W, w)
i = i + 1
print("i=", i, " Objective=",obj)
return [W, w]
However this code, despite seemingly being correct in terms of the matrix multiplications and derivatives I took, does not converge to anything. In fact the error consistantly bounces: it will fall, then rise, then fall back to the same spot, then rise again. I believe that the problem lies with the W matrix gradient but I do not know what exactly it is.
If you'd like to see for yourself what is happening, the input data I used is
0: 0 0 1
0: 1 1 1
1: 1 0 1
1: 0 1 1
where the first number represents the label. I also set the random seed to np.random.seed(0) just so that I could be consistant with my matrices I'm dealing with.

It appears you are attempting to setup a manual version of stochastic gradient decent with a fixed learning rate (a classic NN problem).
Some notes on your code. It is very difficult to follow all the steps you are doing with so much loops and inconsistencies. In general, it defeats the purpose of using np.array() if you are using loops. Likewise you should know that np.matmul() is * and np.dot() is #. It is unclear how you are using the derivative. You have it explicitly stated at the start for the activation function and then partially derived in the middle of your loop for the MSE. Ugh.
Some other pointers. Explicitly state all your functions and your data, those should be globals. Those should also be derived all at once based on your fixed data as np.array(). In particular, note that while traditional statistics (like finding the line of best fit) means that we are solving for a fixed set of weights given a random variable; in stochastic gradient decent, we are doing the opposite. We are instead fixing the random variable to our data and optimizing our weights. Hence, your functions should only have your weights as "free variables", everything else is fixed. It is important to follow what is being fixed and what is free to update. Your code does not reflect that you know what is being update and what is fixed.
SGD algorithm outline:
Random params.
Update params by moving params a small percentage in the direction of lowest decent.
Run step (2) for a specified amount of time.
Print your params.
Example of SGD code (here is an example of performing SGD to find the line of best fit for some data).
import numpy as np
#Data
X = np.random.random((100,)) #Random points
Y = (2.3*X + 8) + 0.1*np.random.random((100,)) #Linear model + Noise
#Functions (only free variable is the params) (we want the F of best fit under MSE)
F = lambda p : p[0]*X+p[1]
dF = lambda p : np.array([X,np.ones(X.shape)])
MSE = lambda p : (1/Y.shape[0])*((Y-F(p))**2).sum(0)
dMSE = lambda p : (1/Y.shape[0])*(-2*(Y-F(p))*dF(p)).sum(1)
#SGD loop
lr = 0.05
epochs = 1000
params = np.array([0.0,0.0])
for i in range(epochs):
params -= lr*dMSE(params)
print(params)
Hopefully, written this way it is super clear exactly where the subtraction of the gradient is occurring and exactly how it is calculated. Note also, in case it wasn't clear, the derivative in both dF and dMSE is with respect to the params. Obviously this is a toy problem that can be solved explicitly with the scipy module. Hence, SGD is a clearly useless way to optimize two variables.
from scipy.stats import linregress
params = linregress(X,Y)
print(params)

I think I figured it out, in my code I was not summing the hidden node weight derivatives and instead was assigning at every loop iteration. The correct version would be as follow
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
temp = 2 * (y - label) * sigmoid_prime(np.dot(w, z))
# Note that s,u,v represent the hidden node weights. My professor required it this way
dells += temp * w[0][0] * sigmoid_prime(np.matmul(W[0,:], x)) * x
dellu += temp * w[0][1] * sigmoid_prime(np.matmul(W[1,:], x)) * x
dellv += temp * w[0][2] * sigmoid_prime(np.matmul(W[2,:], x)) * x

Loss function increasing instead of decreasing

I have been trying to make my own neural networks from scratch. After some time, I made it, but I run into a problem I cannot solve. I have been following a tutorial which shows how to do this. The problem I run into, was how my network updates weights and biases. Well, I know that gradient descent won't be always decreasing loss and for a few epochs it might even increase a bit, bit it still should decrease and work much better than mine does. Sometimes the whole process gets stuck on loss 9 and 13 and it cannot get out of it. I have checked many tutorials, videos and websites, but I couldn't find anything wrong in my code.
self.activate, self.dactivate, self.loss and self.dloss:
# sigmoid
self.activate = lambda x: np.divide(1, 1 + np.exp(-x))
self.dactivate = lambda x: np.multiply(self.activate(x), (1 - self.activate(x)))
# relu
self.activate = lambda x: np.where(x > 0, x, 0)
self.dactivate = lambda x: np.where(x > 0, 1, 0)
# loss I use (cross-entropy)
clip = lambda x: np.clip(x, 1e-10, 1 - 1e-10) # it's used to squeeze x into a probability between 0 and 1 (which I think is required)
self.loss = lambda x, y: -(np.sum(np.multiply(y, np.log(clip(x))) + np.multiply(1 - y, np.log(1 - clip(x))))/y.shape[0])
self.dloss = lambda x, y: -(np.divide(y, clip(x)) - np.divide(1 - y, 1 - clip(x)))
The code I use for forwardpropagation:
self.activate(np.dot(X, self.weights) + self.biases) # it's an example for first hidden layer
And that's the code for backpropagation:
First part, in DenseNeuralNetwork class:
last_derivative = self.dloss(output, y)
for layer in reversed(self.layers):
last_derivative = layer.backward(last_derivative, self.lr)
And the second part, in Dense class:
def backward(self, last_derivative, lr):
w = self.weights
dfunction = self.dactivate(last_derivative)
d_w = np.dot(self.layer_input.T, dfunction) * (1./self.layer_input.shape[1])
d_b = (1./self.layer_input.shape[1]) * np.dot(np.ones((self.biases.shape[0], last_derivative.shape[0])), last_derivative)
self.weights -= np.multiply(lr, d_w)
self.biases -= np.multiply(lr, d_b)
return np.dot(dfunction, w.T)
I have also made a repl so you can check the whole code and run it without any problems.

1.
line 12
self.dloss = lambda x, y: -(np.divide(y, clip(x)) - np.divide(1 - y, 1 - clip(x)))
if you're going to clip x, you shoud clip y too.
I mean there are some ways to implement this, but if you are going to use this way.
change to
self.dloss = lambda x, y: -(np.divide(clip(y), clip(x)) - np.divide(1 - clip(y), 1 - clip(x)))
2.
line 75
dfunction = self.dactivate(last_derivative)
this back propagation part is just wrong.
change to
dfunction = last_derivative*self.dactivate(np.dot(self.layer_input, self.weights) + self.biases)
3.
line 77
d_b = (1./self.layer_input.shape[1]) * np.dot(np.ones((self.biases.shape[0], last_derivative.shape[0])), last_derivative)
last_derivative should be dfunction. I think this is just a mistake.
change to
d_b = (1./self.layer_input.shape[1]) * np.dot(np.ones((self.biases.shape[0], last_derivative.shape[0])), dfunction)
4.
line 85
self.weights = np.random.randn(neurons, self.neurons) * np.divide(6, np.sqrt(self.neurons * neurons))
self.biases = np.random.randn(1, self.neurons) * np.divide(6, np.sqrt(self.neurons * neurons))
Not sure where you are going with this, but I think the initialized values are too big. We're not doing precise hypertuning, so I just made it small.
self.weights = np.random.randn(neurons, self.neurons) * np.divide(6, np.sqrt(self.neurons * neurons)) / 100
self.biases = np.random.randn(1, self.neurons) * np.divide(6, np.sqrt(self.neurons * neurons)) / 100
All good now
After this I changed the learning rate to 0.01 because it was to slow, and it worked fine.
I think you are misunderstanding back propagation. You should probably double check how it works. The other parts are ok I think.

This can be caused by your training data. Either it is too small or too many diverse labels (What i get from your code from the link you share).
I re-run your code several times and it produce different training performance. Sometimes the loss keeps decreasing until last epoch, some times it keep increasing, in one time it decreased until some point and it increasing. (With minimum loss achieved of 0.5)
I think it is your training data that matters this time. The learning rate is good enough though (Assuming you did the calculation for Linear combination, back propagation, etc right).

Cartpole - Simple backprop with 1 hidden layer?

I'm trying to solve the CartPole-v1 problem from OpenAI by using backprop on a one-layer neural network - while updating the model at every time step using State action values (Q(s,a)). I'm unable to get the average reward to go up beyond about 42 steps per episode. Could anyone help? Is my approach even correct - as in, is it even possible for the agent to learn the optimal solution if I'm updating the Q-values every time-step, instead of batch updates every episode? Seems like theoretically it should be possible.
Details: After playing around and experimenting with activation functions, stochastic policies and finally settling on a deterministic policy with linear activation function and the parameters mentioned below - i'm able to get my agent to consistently converge (in about 100-300 steps) to an average reward of about 42 steps. But it doesn't go beyond 45. Adjusting the parameters (epsilon, discount_rate, and learning rate) in the program below does not have a huge impact on this.
I've tried looking for a similar solution online but none of them seem to fit the approach that I'm following. Almost all of the solutions involve learning at the end of each episode (by storing SARS' data).
Increasing the number of hidden layers doesn't help either. I also think it is unlikely that the algorithm will converge to a better value in future as I've run it for 10000+ episodes and it average reward is still around 40.
First, the hyperparameters:
epsilon = 0.5
lr = 0.05
discount_rate=0.9
# number of features in environment observations
num_inputs = 4
hidden_layer_nodes = 6
num_outputs = 2
The q function:
def calculateNNOutput(observation, m1, m2):
scaled_observation = scaleFeatures(observation)
hidden_layer = np.dot(scaled_observation, m1) # 1x4 X 4x6 -> 1x6
outputs = np.dot(hidden_layer, m2) # 1x6 X 6x2
return np.asmatrix(outputs) # 1x2
Action selection (policy):
def selectAction(observation):
#explore
global epsilon
if random.uniform(0,1) < epsilon:
return random.randint(0,1)
#exploit
outputs = calculateNNOutputs(observation)
print(outputs)
if (outputs[0,0] > outputs[0,1]):
return 0
else:
return 1
Backprop:
def backProp(prev_obs, m1, m2, experimental_values):
global lr
scaled_observation = np.asmatrix(scaleFeatures(prev_obs))
hidden_layer = np.asmatrix(np.dot(scaled_observation, m1)) #
outputs = np.asmatrix(np.dot(hidden_layer, m2)) # 1x6 X 6x2
delta_out = np.asmatrix((outputs-experimental_values)) # 1x2
delta_2=np.transpose(np.dot(m2,np.transpose(delta_out))) # 6x2 X 2x1 = 6x1_T = 1x6
GRADIENT_2 = (np.transpose(hidden_layer))*delta_out # 6x1 X 1x2 = 6x2 - same as w2
GRADIENT_1 = np.multiply(np.transpose(scaled_observation), delta_2) # 4 x 6 - same as w1
m1 = m1 - lr*GRADIENT_1
m2 = m2 - lr*GRADIENT_2
return m1, m2
Q-learning:
def updateWeights(prev_obs, action, obs, reward, done):
global weights_1, weights_2
calculated_value = calculateNNOutputs(prev_obs)
if done:
experimental_value = -1
else:
actionValues = calculateNNOutputs(obs) # 1x2
experimental_value = reward + discount_rate*(np.amax(actionValues, axis = 1)[0,0])
if action==0:
weights_1, weights_2 = backProp(prev_obs, weights_1, weights_2, np.array([[experimental_value, calculated_value[0,1]]]))
else:
weights_1, weights_2 = backProp(prev_obs, weights_1, weights_2, np.array([[calculated_value[0,0],experimental_value]]))
EDIT: the main loop -
record = 0
total = 0
for i_episode in range(num_episodes):
if (i_episode%10 == 0):
print("W1 = ", weights_1)
print("W2 = ", weights_2)
observation = env.reset()
epsilon = max(epsilon*0.9,0.01)
lr = max(lr*0.9, 0.01)
print("Average steps = ", total/(i_episode+1))
print("Record = ", record)
for t in range(1000):
action_taken = selectAction(observation)
print(action_taken)
previous_observation=observation
observation, reward, done, info = env.step(action_taken) # take the selected action
updateWeights(previous_observation, action_taken, observation,reward, done) # perform backprop to update the action value
if done:
total = total+t
if t > record:
record = t
print("Episode {} finished after {} timesteps".format(i_episode,t+1))
break
Do I need to make any changes in approach/implementation/parameter tuning?

Gradient descent with random input implementation

I am trying to implement gradient descent on a dataset. Even though I tried everything, I couldn't make it work. So, I created a test case. I am trying my code on a random data and try to debug.
More specifically, what I am doing is, I am generating random vectors between 0-1 and random labels for these vectors. And try to over-fit the training data.
However, my weight vector gets bigger and bigger in each iteration. And then, I have infinities. So, I do not actually learn anything. Here is my code:
import numpy as np
import random
def getRandomVector(n):
return np.random.uniform(0,1,n)
def getVectors(m, n):
return [getRandomVector(n) for i in range(n)]
def getLabels(n):
return [random.choice([-1,1]) for i in range(n)]
def GDLearn(vectors, labels):
maxIterations = 100
stepSize = 0.01
w = np.zeros(len(vectors[0])+1)
for i in range(maxIterations):
deltaw = np.zeros(len(vectors[0])+1)
for i in range(len(vectors)):
temp = np.append(vectors[i], -1)
deltaw += ( labels[i] - np.dot(w, temp) ) * temp
w = w + ( stepSize * (-1 * deltaw) )
return w
vectors = getVectors(100, 30)
labels = getLabels(100)
w = GDLearn(vectors, labels)
print w
I am using LMS for loss function. So, in all iterations, my update is the following,
where w^i is the ith weight vector and R is the stepSize and E(w^i) is the loss function.
Here is my loss function. (LMS)
and here is how I derivated the loss function,
,
Now, my questions are:
Should I expect good results in this random scenario using Gradient Descent? (What is the theoretical bounds?)
If yes, what is my bug in my implementation?
PS: I tried several other maxIterations and stepSize parameters. Still not working.
PS2: This is the best way I can ask the question here. Sorry if the question is too specific. But it made me crazy. I really want to learn the problem.

Your code has a couple of faults:
In GetVectors() method, you did not actually use the input variable m;
In GDLearn() method, you have a double loop, but you use the same variable i as the loop variables in both loops. (I guess the logic is still right, but it's confusing).
The prediction error (labels[i] - np.dot(w, temp)) has the wrong sign.
Step size does matters. If I am using 0.01 as step size, the cost is increasing in each iteration. Changing it to be 0.001 solved the problem.
Here is my revised code based on your original code.
import numpy as np
import random
def getRandomVector(n):
return np.random.uniform(0,1,n)
def getVectors(m, n):
return [getRandomVector(n) for i in range(m)]
def getLabels(n):
return [random.choice([-1,1]) for i in range(n)]
def GDLearn(vectors, labels):
maxIterations = 100
stepSize = 0.001
w = np.zeros(len(vectors[0])+1)
for iter in range(maxIterations):
cost = 0
deltaw = np.zeros(len(vectors[0])+1)
for i in range(len(vectors)):
temp = np.append(vectors[i], -1)
prediction_error = np.dot(w, temp) - labels[i]
deltaw += prediction_error * temp
cost += prediction_error**2
w = w - stepSize * deltaw
print 'cost at', iter, '=', cost
return w
vectors = getVectors(100, 30)
labels = getLabels(100)
w = GDLearn(vectors, labels)
print w
Running result -- you can see the cost is decreasing with each iteration but with a diminishing return.
cost at 0 = 100.0
cost at 1 = 99.4114482617
cost at 2 = 98.8476022685
cost at 3 = 98.2977744556
cost at 4 = 97.7612851154
cost at 5 = 97.2377571222
cost at 6 = 96.7268325883
cost at 7 = 96.2281642899
cost at 8 = 95.7414151147
cost at 9 = 95.2662577529
cost at 10 = 94.8023744037
......
cost at 90 = 77.367904046
cost at 91 = 77.2744249433
cost at 92 = 77.1823702888
cost at 93 = 77.0917090883
cost at 94 = 77.0024111475
cost at 95 = 76.9144470493
cost at 96 = 76.8277881325
cost at 97 = 76.7424064707
cost at 98 = 76.6582748518
cost at 99 = 76.5753667579
[ 0.16232142 -0.2425511 0.35740632 0.22548442 0.03963853 0.19595213
0.20080207 -0.3921798 -0.0238925 0.13097533 -0.1148932 -0.10077534
0.00307595 -0.30111942 -0.17924479 -0.03838637 -0.23938181 0.1384443
0.22929163 -0.0132466 0.03325976 -0.31489526 0.17468025 0.01351012
-0.25926117 0.09444201 0.07637793 -0.05940019 0.20961315 0.08491858
0.07438357]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matrix multiplication quickly overflows in backpropagation equations - python

Related

GPU Optimization of k-armed Bandit Problem

Neural Network Backpropogation code not working

Loss function increasing instead of decreasing

Cartpole - Simple backprop with 1 hidden layer?

Gradient descent with random input implementation

Categories

Resources