I'm trying to figure out just one line in the below snippet I got from here
import numpy as np
X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
y = np.array([[0,1,1,0]]).T
alpha,hidden_dim = (0.5,4)
synapse_0 = 2*np.random.random((3,hidden_dim)) - 1
synapse_1 = 2*np.random.random((hidden_dim,1)) - 1
for j in xrange(60000):
layer_1 = 1/(1+np.exp(-(np.dot(X,synapse_0))))
layer_2 = 1/(1+np.exp(-(np.dot(layer_1,synapse_1))))
layer_2_delta = (layer_2 - y)*(layer_2*(1-layer_2))
layer_1_delta = layer_2_delta.dot(synapse_1.T) * (layer_1 * (1-layer_1))
synapse_1 -= (alpha * layer_1.T.dot(layer_2_delta))
synapse_0 -= (alpha * X.T.dot(layer_1_delta))
The line I cannot figure out is:
layer_1_delta = layer_2_delta.dot(synapse_1.T) * (layer_1 * (1-layer_1))
specifically, why are we doing dot product with the synapse_1 instead of layer_1?
By using synapse_1 in the delta calculation, the partial differential is carried out with respect to weights instead of the layer_1 output which is what we want right?
I think this this is what layer_1_delta should actually be:
layer_1_delta = layer_1.T.dot(layer_2_delta) * (layer_1 * (1-layer_1))
Related
import numpy as np
from matplotlib import pyplot as plt
xk = np.linspace(-1,1,100)
yk= 2 * xk + 3 + np.random.rand(len(xk))
x1,x2 = np.meshgrid(xk,yk)
F = (x1 - 2) ** 2 + 2 * (x2 - 3) ** 2
fig=plt.figure()
surf = fig.add_subplot(1,1,1, projection='3d')
surf.plot_surface(x1,x2,F)
surf.contour(x1,x2,F)
fig, surf=plt.subplots()
plt.contour(x1, x2, F, 20)
m = 0
c = 0
learning_rate=0.01
I think my problem to get the correct result is from here, but I cant find where is the problem
for k in range(10):
shuffel_index=np.random.permutation(len(xk))
xk = xk[shuffel_index]
yk = yk[shuffel_index]
for i in range(len(xk)):
grad_m = - 2 * xk[i] * (yk[i] - (np.dot(m,xk[i]) + c))
grad_c = - 2 * (yk[i] - (np.dot(m,xk[i])+c))
m = m - learning_rate * grad_m
c = c - learning_rate * grad_c
surf.plot(np.array([xk[0], yk[0]]),np.array([xk[1], yk[1]]),'ko-')
if (k != 10 or i != len(xk)):
surf.plot(np.array([xk[0], yk[0]]),np.array([xk[1], yk[1]]),'ko-')
plt.show()
This is my result for the above code
And I wish to get the result like I do for gradient descent algorithm. The example of my gradient descent result.
May I know where is my error?
When I run the following code
from tensorflow import keras
import numpy as np
x = np.ones((1,2,1))
model = keras.models.Sequential()
model.add(keras.layers.GRU(
units = 1, activation='tanh', recurrent_activation='sigmoid',
use_bias=True, kernel_initializer='ones',
recurrent_initializer='ones',bias_initializer='zeros', return_sequences = True))
model.predict(x)
I get the output => array([[[0.20482421], [0.34675306]]], dtype=float32)
When I do this by hand I am getting 0.55
Assuming no biases and all weights are set to 1
hidden_(t-1) = 0
update_gate = sigmoid(1x1 + 1x0) = 0.73
relevance_gate = sigmoid(1x1 + 1x0) = 0.73
candidate_h(t) = tanh( 1 x (0 x 0.73) + 1 x 1) = tanh(1) = 0.76
h(t) = 0.73*0.76 + (1 - 0.73)x0 = 0.55
so shouldn't the first value of the output be 0.55?
You seem to have mistakenly swapped the equation in the last line for hidden state.
sigmoid(1 * 1 + 1 * 0) = 0.73105857863, tanh(1 * 1 + 1 * 0) = 0.761594155956
Ht = Zt ⊙ Ht-1 + (1 - Zt) ⊙ H~t
Since, Ht-1 = 0, this results in, Ht = (1 - Zt) ⊙ H~t
Following the GRU formula I got, h(t) = 0.73105857863 * 0 + (1 - 0.73105857863) x 0.761594155956 = 0.20482421480989209117972 which matches output 0.20482421.
For the next time step,
Rt = Sigmoid(1 * 1 + 1 * 0.20482421) = 0.769381871687
Zt = Sigmoid(1 * 1 + 1 * 0.20482421) = 0.769381871687
H~t = tanh(1 * 1 + 0.769381871687 * 0.20482421 * 1) = 0.8202522791
Ht = 0.769381871687 * 0.20482421 + (1 - 0.769381871687) * 0.8202522791 = 0.346753079407
This matches with final output of 0.34675306.
Reference,
https://d2l.ai/chapter_recurrent-modern/gru.html#hidden-state
https://pytorch.org/docs/stable/generated/torch.nn.GRU.html
I have an example code. When I calculate dloss/dw manually I get the result 8, but the following code gives me a 16. Please tell me how the gradient is 16.
import torch
x = torch.tensor(2.0)
y = torch.tensor(2.0)
w = torch.tensor(3.0, requires_grad=True)
# forward
y_hat = w * x
s = y_hat - y
loss = s**2
#backward
loss.backward()
print(w.grad)
I think you simply miscalculated.
The derivation of loss = (w * x - y) ^ 2 is:
dloss/dw = 2 * (w * x - y) * x = 2 * (3 * 2 - 2) * 2 = 16
Keep in mind that back-propagation in neural networks is done by applying the chain rule: I think you forgot the *x at the end of the derivation
To be specific:
chain rule for derivation says that df(g(x))/dx = f'(g(x)) * g'(x) (derivated with respect to x)
the whole loss function in your case is built like this:
loss(y_hat) = (y_hat - y)^2
y_hat(x) = w * x
thus: loss(y_hat(x)) = (y_hat(x) - y)^2
the derivation of this is according to chain rule:
dloss(y_hat(x))/dw = loss'(y_hat(x)) * dy_hat(x)/dw
for any z:
loss'(z) = 2 * (z - y) * 1 and dy_hat(z)/dw = z
thus: dloss((y_hat(x))/dw = dloss(y_hat(x))/dw = loss'(y_hat(x)) * y_hat'(x) = 2 * (y_hat(x) - z) * dy_hat(x)/dw = 2 * (y_hat(x) - z) * x = 2 * (w * x - z) * x = 16
pytorch knows that in your forward pass each layer applies some kind of function to its input and that your forward pass is 1 * loss(y_hat(x)) and than keeps applying the chain rule for the backward pass (each layer requires one application of the chain rule).
I have made a small neural network that takes two inputs x,y and outputs z, a 1 or 0. There is one hidden layer with two neurons, h1,h2 and one neuron in the output layer. The inputs are height and width, integers between 0 and 100 and the classes are big and small (eg 10,8 is 'small' and 76,92 is 'big). There are linear and non linear datatypes. I have used the sigmoid activation function and am back-propagating with partial differentiation with respect to the weights and biases. I'm not using any Ml libraries. I'm trying to code in most of the maths directly. I can not get it to work. Perhaps I have made a mistake in the backpropagation algorithm as this was the most challenging part. I am hoping someone can point out what I've done wrong. Below is the code:
import random, numpy, math
lr = 1 #learning rate
dt = '1' #data type
epochs = 100000
tda = 50 #training data amount
def step(x): #step function
if x > 0:
x = 1
else:
x = 0
return x
def error(truth, output):
return 0.5 * (truth - output)**2
def sig(x): #sigmoid activation
return 1/(1+numpy.exp(-x))
#weights
w = [random.random(),random.random(),random.random(),random.random(),random.random(),random.random()]
biases
b = [random.random(),random.random(),random.random()]
def Net(x, y, t) : # t is truth (or target)
h1 = x*w[0]+y*w[1]+b[0] #summation in h1, first neuron in hidden layer
h1out = sig(h1) #sigmoid activation
h2 = x*w[2]+y*w[3]+b[1]
h2out = sig(h2)
z = h1out*w[4]+h2out*w[5]+b[2] #z is output neuron
zout = sig(z)
e = error(t, zout) # e is error
#backpropagation, partial differentiations to find error at each weight and bias
e5 = (zout-t) * (zout * (1 - zout)) * h1out #e5 is error at 5th weight etc
e6 = (zout-t) * (zout * (1 - zout)) * h2out
e1 = (zout-t) * (zout * (1 - zout)) * w[4] * (h1out * (1 - h1out)) * x
e2 = (zout-t) * (zout * (1 - zout)) * w[4] * (h1out * (1 - h1out)) * y
e3 = (zout-t) * (zout * (1 - zout)) * w[5] * (h2out * (1 - h2out)) * x
e4 = (zout-t) * (zout * (1 - zout)) * w[5] * (h2out * (1 - h2out)) * y
be3 = (zout-t) * (zout * (1 - zout)) error at 3rd bias etc
be1 = (zout-t) * (zout * (1 - zout)) * w[4] * (h1out * (1 - h1out))
be2 = (zout-t) * (zout * (1 - zout)) * w[5] * (h2out * (1 - h2out))
#print (e1, e2, e3, e4, e5, e6, be1, be2, be3)
#updating weights and biases
w[0] = w[0] - (e1 * lr)
w[1] = w[1] - (e2 * lr)
w[2] = w[2] - (e3 * lr)
w[3] = w[3] - (e4 * lr)
w[4] = w[4] - (e5 * lr)
w[5] = w[5] - (e6 * lr)
b[2] = b[2] - (be3 * lr)
b[0] = b[0] - (be1 * lr)
b[1] = b[1] - (be2 * lr)
your backpropagation should look like this:
and that continues to less deeper layers. so you have to basically calculate the error of every neuron through the derivative of it's activation function. and use it for the error of previous layers.
import numpy as np
from numpy import exp
from numpy import random
from numpy import log
from numpy import size
from numpy import amax
from numpy import multiply as mul
from numpy import vstack
from matplotlib import pyplot as plot
def sigmoid(x):
return 1.0 / (1+ exp(-x) )
def backPro(X,y,alpha):
a1=X.T
n1=size(X,1)
n2=10
m=size(a1,1)
a1 = vstack((np.ones((1,m)), a1))
K=size(y,0)
Jlist=[]
s=100
lit=0
#initialization (radonize)
theta1=random.rand(n2,n1+1)
theta2=random.rand(K,n2+1)
while lit<10000:
z2 = theta1 * a1
a2 = sigmoid(z2)
a2 = vstack((np.ones((1,m)), a2))
z3 = theta2 * a2
H = sigmoid(z3)
J = (-1.0/m) * np.sum(mul(y,log(H)) + mul(1-y,log(1-H)))
Jlist.append(J)
sigma3 = H-y
sigma2 = mul( (theta2.T * sigma3), mul(a2 , (1-a2)) )
delta1 = sigma2[1:] * a1.T * (1.0/m)
delta2 = sigma3 * a2.T * (1.0/m)
#do the gradient descent
theta1 -= (delta1 * alpha)
theta2 -= (delta2 * alpha)
#update the s
#s=max( amax(np.abs(delta1)),amax(np.abs(delta2)) )
lit+=1
plot.scatter(range(0,len(Jlist)),Jlist,1)
plot.scatter(range(0,len(Jlist[-100:])), Jlist[-100:],1)
print "The J is "+str(J)
print "The S is "+str(s)
print "lit is "+str(lit)
return theta1,theta2
def cost(a1,y,h):
m=size(a1,0)
Kmatrix=np.sum( mul(y, log(h)) + mul((1-y), log(1-h)) )
J= (-1.0/m)* Kmatrix
return J
def predict(X, theta1, theta2):
X = X.T
m = size(X,1)
X = vstack((np.ones((1,m)),X))
hidden = sigmoid(theta1 * X)
hidden = vstack((np.ones((1,m)),hidden))
print sigmoid(theta2 * hidden)
return np.argmax( sigmoid(theta2 * hidden),0 ).T + 1
I want to use this multiclassifier to classify the handwritten digits which has already been transferred as numbers (the original images are just simply binary pixel which means that 1 represents a pixel is pure black otherwise pure white)
In practice, X stores the data in a form of numpy.matrix and has a dimention of n1 * m (row * collon), here n1 represents the number of input layer Nodes, and m is the number of data sets, and y stores the labels (results from 1 to 9), also in a form of numpy.matrix, and has a dimention of K * m (K is the number of labels which is 9 here, and m is also the number of dataset). By the way, I have cancelled the calculation for s in backPro.
when I put X,y,and alpha (learning rate) into the backPro and get the theta1 and theta2, I then use these two parameters calculate the output using the training set (Yes, it is just the training set), but find that the predictions for all the datasets are the same, here I mean exactly the same, I find it incredible..
and here is result:
from simpleneu import predict as pre
pre(X,theta1,theta2)
[[0.10106717 0.10106717 0.10106717 ... 0.10106717 0.10106717 0.10106717]
[0.10169492 0.10169492 0.10169492 ... 0.10169492 0.10169492 0.10169492]
[0.09981168 0.09981168 0.09981168 ... 0.09981168 0.09981168 0.09981168]
...
[0.09918393 0.09918393 0.09918393 ... 0.09918393 0.09918393 0.09918393]
[0.09730069 0.09730069 0.09730069 ... 0.09730069 0.09730069 0.09730069]
[0.09918393 0.09918393 0.09918393 ... 0.09918393 0.09918393 0.09918393]]
Out[99]:
matrix([[2],
[2],
[2],
...,
[2],
[2],
[2]])
and the dimention of the output is K * m, and in the predict() function I print out the sigmoid(theta2 * hidden) before transferring them as the labels of ones with the largest potential
and here is the graph for J which is the cost with respect to iteration times: