I have trained a Neural Net to solve the XOR problem. The problem with my network is that it is not converging. I am using Andrew Ng's methods and notations as taught in the DeepLearning.ai course.
Here's the code :
import numpy as np
from __future__ import print_function
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
Y = np.array([[0, 1, 1, 0]])
np.random.seed(1)
W1 = np.random.randn(3, 2) * 0.0001
b1 = np.ones((3, 1))
W2 = np.random.randn(1, 3) * 0.0001
b2 = np.ones((1, 1))
The next part for the Backpropagation:
learning_rate = 0.01
m = 4
for iteration in range(100000):
# forward propagation
# layer1
Z1 = np.dot(W1, X.T) + b1
A1 = sigmoid(Z1)
# layer2
Z2 = np.dot(W2, A1) + b2
A2 = sigmoid(Z2)
# backpropagation
dZ2 = Y - A2
dW2 = (1 / m) * np.dot(dZ2, A1.T)
db2 = (1 / m) * np.sum(dZ2, axis=1, keepdims=True)
dZ1 = np.dot(dW2.T, dZ2) * sigmoid_gradient(Z1)
dW1 = (1 / m) * np.dot(dZ1, X)
db1 = (1 / m) * np.sum(dZ1, axis=1, keepdims=True)
# checking if shapes are correctly preserved
assert (dZ2.shape == Z2.shape)
assert (dW2.shape == W2.shape)
assert (db2.shape == b2.shape)
assert (dZ1.shape == Z1.shape)
assert (dW1.shape == W1.shape)
assert (db1.shape == b1.shape)
# update parameters
W1 = W1 + learning_rate * dW1
W2 = W2 + learning_rate * dW2
b1 = b1 + learning_rate * db1
b2 = b2 + learning_rate * db2
# print every 10k
if (iteration % 10000 == 0):
print(A2)
You have made a couple of mistakes in your code. For example, in computing the W2.
...
dZ2 = Y - A2
dW2 = (1 / m) * np.dot(dZ2, A1.T)
...
W2 = W2 + learning_rate * dW2
We want to calculate the derivative of Cost with respect to W2 using the chain rule.
We can write the derivatives as follows:
You haven't implemented the middle part which computes the derivative of the Z2.
You can check out this video, it explains the math part of backpropagation. Moreover, you can check out this simple implementation of the neural network.
Related
I am training a model to predict pose using a custom Pytorch model. However, V1 below never learns (params don't change). The output is connected to the backdrop graph and grad_fn=MmBackward.
I can't understand why V1 isn't learning but V2 is?
V1
class cam_pose_transform_V1(torch.nn.Module):
def __init__(self):
super(cam_pose_transform, self).__init__()
self.elevation_x_rotation_radians = torch.nn.Parameter(torch.normal(0., 1e-6, size=()))
self.azimuth_y_rotation_radians = torch.nn.Parameter(torch.normal(0., 1e-6, size=()))
self.z_rotation_radians = torch.nn.Parameter(torch.normal(0., 1e-6, size=()))
def forward(self, x):
exp_i = torch.zeros((4,4))
c1 = torch.cos(self.elevation_x_rotation_radians)
s1 = torch.sin(self.elevation_x_rotation_radians)
c2 = torch.cos(self.azimuth_y_rotation_radians)
s2 = torch.sin(self.azimuth_y_rotation_radians)
c3 = torch.cos(self.z_rotation_radians)
s3 = torch.sin(self.z_rotation_radians)
rotation_in_matrix = torch.tensor([
[c2, s2 * s3, c3 * s2],
[s1 * s2, c1 * c3 - c2 * s1 * s3, -c1 * s3 - c2 * c3 * s1],
[-c1 * s2, c3 * s1 + c1 * c2 * s3, c1 * c2 * c3 - s1 * s3]
], requires_grad=True)
exp_i[:3, :3] = rotation_in_matrix
exp_i[3, 3] = 1.
return torch.matmul(exp_i, x)
However, this version learns as expected (params and loss change) and also has grad_fn=MmBackward on the output:
V2
def vec2ss_matrix(vector): # vector to skewsym. matrix
ss_matrix = torch.zeros((3,3))
ss_matrix[0, 1] = -vector[2]
ss_matrix[0, 2] = vector[1]
ss_matrix[1, 0] = vector[2]
ss_matrix[1, 2] = -vector[0]
ss_matrix[2, 0] = -vector[1]
ss_matrix[2, 1] = vector[0]
return ss_matrix
class cam_pose_transform_V2(torch.nn.Module):
def __init__(self):
super(camera_transf, self).__init__()
self.w = torch.nn.Parameter(torch.normal(0., 1e-6, size=(3,)))
self.v = torch.nn.Parameter(torch.normal(0., 1e-6, size=(3,)))
self.theta = torch.nn.Parameter(torch.normal(0., 1e-6, size=()))
def forward(self, x):
exp_i = torch.zeros((4,4))
w_skewsym = vec2ss_matrix(self.w)
v_skewsym = vec2ss_matrix(self.v)
exp_i[:3, :3] = torch.eye(3) + torch.sin(self.theta) * w_skewsym + (1 - torch.cos(self.theta)) * torch.matmul(w_skewsym, w_skewsym)
exp_i[:3, 3] = torch.matmul(torch.eye(3) * self.theta + (1 - torch.cos(self.theta)) * w_skewsym + (self.theta - torch.sin(self.theta)) * torch.matmul(w_skewsym, w_skewsym), self.v)
exp_i[3, 3] = 1.
return torch.matmul(exp_i, x)
Update #1
In the training loop I printed the .grad attributes using:
print([i.grad for i in list(cam_pose.parameters())])
loss.backward()
print([i.grad for i in list(cam_pose.parameters())])
Results:
# V1
[None, None, None]
[None, None, None]
# V2
[None, None, None]
[tensor([-0.0032, 0.0025, -0.0053]), tensor([ 0.0016, -0.0013, 0.0054]), tensor(-0.0559)]
Nothing else in the code was changed, just swapped V1 model for V2.
this is your problem right here:
rotation_in_matrix = torch.tensor([
[c2, s2 * s3, c3 * s2],
[s1 * s2, c1 * c3 - c2 * s1 * s3, -c1 * s3 - c2 * c3 * s1],
[-c1 * s2, c3 * s1 + c1 * c2 * s3, c1 * c2 * c3 - s1 * s3]], requires_grad=True)
you are creating a tensor out of a list of tensors, which is not a differentiable operation -- i.e. there's no gradient flow from rotation_in_matrix to its elements c1..c3
the solution would be to create the rotation_in_matrix using tensor operations like stack and cat instead
I am trying to learn the linear equation y = x1 + x2 + e where e is a random error between 0 and 0.5.
The data is defined as this:
X1 = np.random.randint(1, 10000, 5000)
X2 = np.random.randint(1, 10000, 5000)
e = np.array([random.uniform(0, 0.5) for i in range(5000)])
y = X1 + X2 + e
When I am implementing a simple gradient descent to find the parameters, the Loss and gradients all are exploding. Where am I going wrong? The code for gradient descent:
w1, w2, b = 1, 1, 0
n = X1.shape[0]
alpha = 0.01
for i in range(5):
y_pred = w1 * X1 + w2 * X2 + b
L = np.sum(np.square(y - y_pred))/(2 * n)
dL_dw1 = (-1/n) * np.sum((y - y_pred) * X1)
dL_dw2 = (-1/n) * np.sum((y - y_pred) * X2)
dL_db = (-1/n) * np.sum((y - y_pred))
w1 = w1 - alpha * dL_dw1
w2 = w2 - alpha * dL_dw2
b = b - alpha * dL_db
print(L, w1, w2, b)
The output for this is:
0.042928723015982384 , 13.7023102434034 , 13.670617201430483 , 0.00254938447277222
9291487188.8259 , -7353857.489486973 , -7293941.123714662 , -1261.9252592161051
3.096713445664372e+21 , 4247172241132.3584 , 4209117175658.749 , 728518135.2857293
1.0320897597938595e+33 , -2.4520737800716524e+18 , -2.4298158059267333e+18 , -420579738783719.2
3.4398058610314825e+44 , 1.415615899689713e+24 , 1.402742160404974e+24 , 2.428043942370682e+20
All you are missing is Data normalization. For Gradient based learning algorithms you have to make sure the data is normalized i.e it has mean=0 and std=1.
Lets verify so by having a constant error (say e=33).
X1 = np.random.randint(1, 10000, 5000)
X2 = np.random.randint(1, 10000, 5000)
e = 33
# Normalize data
X1 = (X1 - np.mean(X1))/np.std(X1)
X2 = (X2 - np.mean(X2))/np.std(X2)
y = X1 + X2 + e
w1, w2, b = np.random.rand(), np.random.rand(), np.random.rand()
n = X1.shape[0]
alpha = 0.01
for i in range(1000):
y_pred = w1 * X1 + w2 * X2 + b
L = np.sum(np.square(y - y_pred))/(2 * n)
dL_dw1 = (-1/n) * np.sum((y - y_pred) * X1)
dL_dw2 = (-1/n) * np.sum((y - y_pred) * X2)
dL_db = (-1/n) * np.sum((y - y_pred))
w1 = w1 - alpha * dL_dw1
w2 = w2 - alpha * dL_dw2
b = b - alpha * dL_db
if (i)%100 == 0:
print(L)
print (w1, w2, b)
Output:
Loss: 517.7575710514508
Loss: 69.36601211594098
Loss: 9.29326322560041
Loss: 1.2450619081931993
Loss: 0.16680720657514425
Loss: 0.022348057963833764
Loss: 0.002994096883392299
Loss: 0.0004011372165515275
Loss: 5.374289796164062e-05
Loss: 7.2002934167549005e-06
0.9999609731610163 0.9999911458582055 32.99861157362915
As you can see it did converge.
There are no issues in your code except that you have to normalize your data.
Now you can plug back your error and find the best possible estimates.
Okay there are a few problem with the problem formulation
Scaling: Gradient descents generally need the variables to be scaled well in order to ensure that the alpha can be set properly. Everything is relative in most of the cases and you can always multiply a problem by a fixed constant. However because the weights are manipulated directly by alpha value the very high or very low values of the weights are harder to reach I am hereby scaling your mechanism down by about 10000 and also reducing the random error to scale
import numpy as np
import random
X1 = np.random.random(5000)
X2 = np.random.random(5000)
e = np.array([random.uniform(0, 0.0005) for i in range(5000)])
y = X1 + X2 + e
Dependence of y_pred on b: The Value of B i am not sure what it is supposed to do and why are you explicitly introducing an error to y_pred. Your prediction should assume that there is no error :D
If X and Ys are scaled well a few tries with hyperparameter would yield a good value
for i in range(5):
y_pred = w1 * X1 + w2 * X2
L = np.sum(np.square(y - y_pred))/(2 * n)
dL_dw1 = -(1/n) * np.sum((y - y_pred) * X1)
dL_dw2 = -(1/n) * np.sum((y - y_pred) * X2)
dL_db = -(1/n) * np.sum((y - y_pred))
w1 = w1 - alpha * dL_dw1
w2 = w2 - alpha * dL_dw2
print(L, w1, w2)
You can play around with those values but they will converge
w1, w2, b = 1.1, 0.9, 0.01
alpha = 1
0.0008532534726479387 1.0911950693892498 0.9082610891021278
0.0007137567968828647 1.0833134985852988 0.9159869797801239
0.0005971536415151483 1.0761750602775175 0.9231234590515701
0.0004996145120126794 1.0696746682185534 0.9296797694772246
0.0004180103133293466 1.0637407602096771 0.9356885401106588
I'm trying to figure out just one line in the below snippet I got from here
import numpy as np
X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
y = np.array([[0,1,1,0]]).T
alpha,hidden_dim = (0.5,4)
synapse_0 = 2*np.random.random((3,hidden_dim)) - 1
synapse_1 = 2*np.random.random((hidden_dim,1)) - 1
for j in xrange(60000):
layer_1 = 1/(1+np.exp(-(np.dot(X,synapse_0))))
layer_2 = 1/(1+np.exp(-(np.dot(layer_1,synapse_1))))
layer_2_delta = (layer_2 - y)*(layer_2*(1-layer_2))
layer_1_delta = layer_2_delta.dot(synapse_1.T) * (layer_1 * (1-layer_1))
synapse_1 -= (alpha * layer_1.T.dot(layer_2_delta))
synapse_0 -= (alpha * X.T.dot(layer_1_delta))
The line I cannot figure out is:
layer_1_delta = layer_2_delta.dot(synapse_1.T) * (layer_1 * (1-layer_1))
specifically, why are we doing dot product with the synapse_1 instead of layer_1?
By using synapse_1 in the delta calculation, the partial differential is carried out with respect to weights instead of the layer_1 output which is what we want right?
I think this this is what layer_1_delta should actually be:
layer_1_delta = layer_1.T.dot(layer_2_delta) * (layer_1 * (1-layer_1))
I have made a small neural network that takes two inputs x,y and outputs z, a 1 or 0. There is one hidden layer with two neurons, h1,h2 and one neuron in the output layer. The inputs are height and width, integers between 0 and 100 and the classes are big and small (eg 10,8 is 'small' and 76,92 is 'big). There are linear and non linear datatypes. I have used the sigmoid activation function and am back-propagating with partial differentiation with respect to the weights and biases. I'm not using any Ml libraries. I'm trying to code in most of the maths directly. I can not get it to work. Perhaps I have made a mistake in the backpropagation algorithm as this was the most challenging part. I am hoping someone can point out what I've done wrong. Below is the code:
import random, numpy, math
lr = 1 #learning rate
dt = '1' #data type
epochs = 100000
tda = 50 #training data amount
def step(x): #step function
if x > 0:
x = 1
else:
x = 0
return x
def error(truth, output):
return 0.5 * (truth - output)**2
def sig(x): #sigmoid activation
return 1/(1+numpy.exp(-x))
#weights
w = [random.random(),random.random(),random.random(),random.random(),random.random(),random.random()]
biases
b = [random.random(),random.random(),random.random()]
def Net(x, y, t) : # t is truth (or target)
h1 = x*w[0]+y*w[1]+b[0] #summation in h1, first neuron in hidden layer
h1out = sig(h1) #sigmoid activation
h2 = x*w[2]+y*w[3]+b[1]
h2out = sig(h2)
z = h1out*w[4]+h2out*w[5]+b[2] #z is output neuron
zout = sig(z)
e = error(t, zout) # e is error
#backpropagation, partial differentiations to find error at each weight and bias
e5 = (zout-t) * (zout * (1 - zout)) * h1out #e5 is error at 5th weight etc
e6 = (zout-t) * (zout * (1 - zout)) * h2out
e1 = (zout-t) * (zout * (1 - zout)) * w[4] * (h1out * (1 - h1out)) * x
e2 = (zout-t) * (zout * (1 - zout)) * w[4] * (h1out * (1 - h1out)) * y
e3 = (zout-t) * (zout * (1 - zout)) * w[5] * (h2out * (1 - h2out)) * x
e4 = (zout-t) * (zout * (1 - zout)) * w[5] * (h2out * (1 - h2out)) * y
be3 = (zout-t) * (zout * (1 - zout)) error at 3rd bias etc
be1 = (zout-t) * (zout * (1 - zout)) * w[4] * (h1out * (1 - h1out))
be2 = (zout-t) * (zout * (1 - zout)) * w[5] * (h2out * (1 - h2out))
#print (e1, e2, e3, e4, e5, e6, be1, be2, be3)
#updating weights and biases
w[0] = w[0] - (e1 * lr)
w[1] = w[1] - (e2 * lr)
w[2] = w[2] - (e3 * lr)
w[3] = w[3] - (e4 * lr)
w[4] = w[4] - (e5 * lr)
w[5] = w[5] - (e6 * lr)
b[2] = b[2] - (be3 * lr)
b[0] = b[0] - (be1 * lr)
b[1] = b[1] - (be2 * lr)
your backpropagation should look like this:
and that continues to less deeper layers. so you have to basically calculate the error of every neuron through the derivative of it's activation function. and use it for the error of previous layers.
I'm currently writing my own code to implement a single-hidden-layer neural network and test the model on MNIST dataset. But I got wired result(NLL is unacceptably high) though I checked my code for over 2 days without finding what's went wrong.
Here're global parameters:
layers = np.array([784, 300, 10])
learningRate = 0.01
momentum = 0.01
batch_size = 10000
num_of_batch = len(train_label)/batch_size
nepoch = 30
Softmax function definition:
def softmax(x):
x = np.exp(x)
x_sum = np.sum(x,axis=1) #shape = (nsamples,)
for row_idx in range(len(x)):
x[row_idx,:] /= x_sum[row_idx]
return x
Sigmoid function definition:
def f(x):
return 1.0/(1+np.exp(-x))
initialize w and b
k = np.vectorize(math.sqrt)(layers[0:-2]*layers[1:])
w1 = np.random.uniform(-0.5, 0.5, layers[0:2][::-1])
b1 = np.random.uniform(-0.5, 0.5, (1,layers[1]))
w2 = np.random.uniform(-0.5, 0.5, layers[1:3][::-1])
b2 = np.random.uniform(-0.5, 0.5, (1,layers[2]))
And the following is the core part for each mini-batch:
for idx in range(num_of_batch):
# forward_vectorized
x = train_set[idx*batch_size:(idx+1)*batch_size,:]
y = Y[idx*batch_size:(idx+1)*batch_size,:]
a1 = x
a2 = f(np.dot(np.insert(a1,0,1,axis=1),np.insert(w1,0,b1,axis=1).T))
a3 = softmax(np.dot(np.insert(a2,0,1,axis=1),np.insert(w2,0,b2,axis=1).T))
# compute delta
d3 = a3-y
d2 = np.dot(d3,w2)*a2*(1.0-a2)
# compute grad
D2 = np.dot(d3.T,a2)
D1 = np.dot(d2.T,a1)
# update_parameters
w1 = w1 - learningRate*(D1/batch_size + momentum*w1)
b1 = b1 - learningRate*(np.sum(d2,axis=0)/batch_size)
w2 = w2 - learningRate*(D2/batch_size+ momentum*w2)
b2 = b2 - learningRate*(np.sum(d3,axis=0)/batch_size)
e = -np.sum(y*np.log(a3))/batch_size
err.append(e)
After one epoch(50,000 samples), I got the following sequence of e, which seems to be too large:
Out[1]:
10000/50000 4.033538
20000/50000 3.924567
30000/50000 3.761105
40000/50000 3.632708
50000/50000 3.549212
I think the back_prop code should be correct and I couldn't find what's going wrong. It has tortured me for over 2 days.