Weights in Numpy Neural Net Not Updating, Error is Static - python

I'm trying to build a neural network on the Mnist dataset for a HW assignment. I'm not asking anyone to DO the assignment for me, I'm just having trouble figuring out why the Training accuracy and Test Accuracy seem to be static for every epoch?
It's as if my way of updating weights is not working.
Epoch: 0, Train Accuracy: 10.22%, Train Cost: 3.86, Test Accuracy: 10.1%
Epoch: 1, Train Accuracy: 10.22%, Train Cost: 3.86, Test Accuracy: 10.1%
Epoch: 2, Train Accuracy: 10.22%, Train Cost: 3.86, Test Accuracy: 10.1%
Epoch: 3, Train Accuracy: 10.22%, Train Cost: 3.86, Test Accuracy: 10.1%
.
.
.
However, when I run the actual forward and backprop lines in a loop without any 'fluff' of classes or methods the cost goes down. I just can't seem to get it working in the current class setup.
I've tried building my own methods that pass the weights and biases between the backprop and feed-forward methods explicitly, however, those changes haven't done anything to fix this gradient descent issue.
I'm pretty sure it has to do with the definition of the backprop method in the NeuralNetwork class below. I've been struggling to find a way to update the weights by accessing the weight and bias variables in the main training loop.
def backward(self, Y_hat, Y):
'''
Backward pass through network. Update parameters
INPUT
Y_hat: Network predicted
shape: (?, 10)
Y: Correct target
shape: (?, 10)
RETURN
cost: calculate J for errors
type: (float)
'''
#Naked Backprop
dJ_dZ2 = Y_hat - Y
dJ_dW2 = np.matmul(np.transpose(X2), dJ_dZ2)
dJ_db2 = Y_hat - Y
dJ_dX2 = np.matmul(dJ_db2, np.transpose(NeuralNetwork.W2))
dJ_dZ1 = dJ_dX2 * d_sigmoid(Z1)
inner_mat = np.matmul(Y-Y_hat,np.transpose(NeuralNetwork.W2))
dJ_dW1 = np.matmul(np.transpose(X),inner_mat) * d_sigmoid(Z1)
dJ_db1 = np.matmul(Y - Y_hat, np.transpose(NeuralNetwork.W2)) * d_sigmoid(Z1)
lr = 0.1
# weight updates here
#just line 'em up and do lr * the dJ_.. vars you found above
NeuralNetwork.W2 = NeuralNetwork.W2 - lr * dJ_dW2
NeuralNetwork.b2 = NeuralNetwork.b2 - lr * dJ_db2
NeuralNetwork.W1 = NeuralNetwork.W1 - lr * dJ_dW1
NeuralNetwork.b1 = NeuralNetwork.b1 - lr * dJ_db1
# calculate the cost
cost = -1 * np.sum(Y * np.log(Y_hat))
# calc gradients
# weight updates
return cost#, W1, W2, b1, b2
I'm really at a loss here, any help is appreciated!
Full code is shown here...
import keras
import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
np.random.seed(0)
"""### Load MNIST Dataset"""
(x_train, y_train), (x_test, y_test) = mnist.load_data()
X = x_train[0].reshape(1,-1)/255.; Y = y_train[0]
zeros = np.zeros(10); zeros[Y] = 1
Y = zeros
#Here we implement the forward pass for the network using the single example, $X$, from above
### Initialize weights and Biases
num_hidden_nodes = 200
num_classes = 10
# init weights
#first set of weights (these are what the input matrix is multiplied by)
W1 = np.random.uniform(-1e-3,1e-3,size=(784,num_hidden_nodes))
#this is the first bias layer and i think it's a 200 dimensional vector of the biases that go into each neuron before the sigmoid function.
b1 = np.zeros((1,num_hidden_nodes))
#again this are the weights for the 2nd layer that are multiplied by the activation output of the 1st layer
W2 = np.random.uniform(-1e-3,1e-3,size=(num_hidden_nodes,num_classes))
#these are the biases that are added to each neuron before the final softmax activation.
b2 = np.zeros((1,num_classes))
# multiply input with weights
Z1 = np.add(np.matmul(X,W1), b1)
def sigmoid(z):
return 1 / (1 + np.exp(- z))
def d_sigmoid(g):
return sigmoid(g) * (1. - sigmoid(g))
# activation function of Z1
X2 = sigmoid(Z1)
Z2 = np.add(np.matmul(X2,W2), b2)
# softmax
def softmax(z):
# subracting the max adds numerical stability
shiftx = z - np.max(z)
exps = np.exp(shiftx)
return exps / np.sum(exps)
def d_softmax(Y_hat, Y):
return Y_hat - Y
# the hypothesis,
Y_hat = softmax(Z2)
"""Initially the network guesses all categories equally. As we perform backprop the network will get better at discerning images and their categories."""
"""### Calculate Cost"""
cost = -1 * np.sum(Y * np.log(Y_hat))
#so i think the main thing here is like a nested chain rule thing, where we find the change in the cost with respec to each
# set of matrix weights and biases?
#here is probably the order of how we do things based on whats in math below...
'''
1. find the partial deriv of the cost function with respect to the output of the second layer, without the softmax it looks like for some reason?
2. find the partial deriv of the cost function with respect to the weights of the second layer, which is dope cause we can re-use the partial deriv from step 1
3. this one I know intuitively we're looking for the parial deriv of cost with respect to the bias term of the second layer, but how TF does that math translate into
numpy? is that the same y_hat - Y from the first step? where is there anyother Y_hat - y?
4. This is also confusing cause I know where to get the weights for layer 2 from and how to transpose them, but again, where is the Y_hat - Y?
5. Here we take the missing partial deriv from step 4 and multiply it by the d_sigmoid function of the first layer outputs before activations.
6. In this step we multiply the first layer weights (transposed) by the var from 5
7. And this is weird too, this just seems like the same step as number 5 repeated for some reason but with y-y_hat instead of y_hat-y
'''
#look at tutorials like this https://www.youtube.com/watch?v=7qYtIveJ6hU
#I think the most backprop layer steps are fine without biases but how do we find the bias derivatives
#maybe just the hypothesis matrix minus the actual y matrix?
dJ_dZ2 = Y_hat - Y
#find partial deriv of cost w respect to 2nd layer weights
dJ_dW2 = np.matmul(np.transpose(X2), dJ_dZ2)
#finding the partial deriv of cost with respect to the 2nd layer biases
#I'm still not 100% sure why this is here and why it works out to Y_hat - Y
dJ_db2 = Y_hat - Y
#finding the partial deriv of cost with respect to 2nd layer inputs
dJ_dX2 = np.matmul(dJ_db2, np.transpose(W2))
#finding the partial deriv of cost with respect to Activation of layer 1
dJ_dZ1 = dJ_dX2 * d_sigmoid(Z1)
#y-yhat matmul 2nd layer weights
#I added the transpose to the W2 var because the matrices were not compaible sizes without it
inner_mat = np.matmul(Y-Y_hat,np.transpose(W2))
dJ_dW1 = np.matmul(np.transpose(X),inner_mat) * d_sigmoid(Z1)
class NeuralNetwork:
# set learning rate
lr = 0.01
# init weights
W1 = np.random.uniform(-1e-3,1e-3,size=(784,num_hidden_nodes))
b1 = np.zeros((1,num_hidden_nodes))
W2 = np.random.uniform(-1e-3,1e-3,size=(num_hidden_nodes,num_classes))
b2 = np.zeros((1,num_classes))
def __init__(self, num_hidden_nodes, num_classes, lr=0.01):
'''
# set learning rate
lr = lr
# init weights
W1 = np.random.uniform(-1e-3,1e-3,size=(784,num_hidden_nodes))
b1 = np.zeros((1,num_hidden_nodes))
W2 = np.random.uniform(-1e-3,1e-3,size=(num_hidden_nodes,num_classes))
b2 = np.zeros((1,num_classes))
'''
def forward(self, X1):
'''
Forward pass through the network
INPUT
X: input to network
shape: (?, 784)
RETURN
Y_hat: prediction from output of network
shape: (?, 10)
'''
Z1 = np.add(np.matmul(X,W1), b1)
X2 = sigmoid(Z1)# activation function of Z1
Z2 = np.add(np.matmul(X2,W2), b2)
Y_hat = softmax(Z2)
#return the hypothesis
return Y_hat
# store input for backward pass
# you can basically copy and past what you did in the forward pass above here
# think about what you need to store for the backward pass
return
def backward(self, Y_hat, Y):
'''
Backward pass through network. Update parameters
INPUT
Y_hat: Network predicted
shape: (?, 10)
Y: Correct target
shape: (?, 10)
RETURN
cost: calculate J for errors
type: (float)
'''
#Naked Backprop
dJ_dZ2 = Y_hat - Y
dJ_dW2 = np.matmul(np.transpose(X2), dJ_dZ2)
dJ_db2 = Y_hat - Y
dJ_dX2 = np.matmul(dJ_db2, np.transpose(NeuralNetwork.W2))
dJ_dZ1 = dJ_dX2 * d_sigmoid(Z1)
inner_mat = np.matmul(Y-Y_hat,np.transpose(NeuralNetwork.W2))
dJ_dW1 = np.matmul(np.transpose(X),inner_mat) * d_sigmoid(Z1)
dJ_db1 = np.matmul(Y - Y_hat, np.transpose(NeuralNetwork.W2)) * d_sigmoid(Z1)
lr = 0.1
# weight updates here
#just line 'em up and do lr * the dJ_.. vars you found above
NeuralNetwork.W2 = NeuralNetwork.W2 - lr * dJ_dW2
NeuralNetwork.b2 = NeuralNetwork.b2 - lr * dJ_db2
NeuralNetwork.W1 = NeuralNetwork.W1 - lr * dJ_dW1
NeuralNetwork.b1 = NeuralNetwork.b1 - lr * dJ_db1
# calculate the cost
cost = -1 * np.sum(Y * np.log(Y_hat))
# calc gradients
# weight updates
return cost#, W1, W2, b1, b2
nn = NeuralNetwork(200,10,lr=.01)
num_train = float(len(x_train))
num_test = float(len(x_test))
for epoch in range(10):
train_correct = 0; train_cost = 0
# training loop
for i in range(len(x_train)):
x = x_train[i]; y = y_train[i]
# standardizing input to range 0 to 1
X = x.reshape(1,784) /255.
# forward pass through network
Y_hat = nn.forward(X)
# get pred number
pred_num = np.argmax(Y_hat)
# check if prediction was accurate
if pred_num == y:
train_correct += 1
# make a one hot categorical vector; same as keras.utils.to_categorical()
zeros = np.zeros(10); zeros[y] = 1
Y = zeros
# compute gradients and update weights
train_cost += nn.backward(Y_hat, Y)
test_correct = 0
# validation loop
for i in range(len(x_test)):
x = x_test[i]; y = y_test[i]
# standardizing input to range 0 to 1
X = x.reshape(1,784) /255.
# forward pass
Y_hat = nn.forward(X)
# get pred number
pred_num = np.argmax(Y_hat)
# check if prediction was correct
if pred_num == y:
test_correct += 1
# no backward pass here!
# compute average metrics for train and test
train_correct = round(100*(train_correct/num_train), 2)
test_correct = round(100*(test_correct/num_test ), 2)
train_cost = round( train_cost/num_train, 2)
# print status message every epoch
log_message = 'Epoch: {epoch}, Train Accuracy: {train_acc}%, Train Cost: {train_cost}, Test Accuracy: {test_acc}%'.format(
epoch=epoch,
train_acc=train_correct,
train_cost=train_cost,
test_acc=test_correct
)
print (log_message)
also, The project is in this colab & ipynb notebook

I believe this is pretty clear, in this part of your loop:
for epoch in range(10):
train_correct = 0; train_cost = 0
# training loop
for i in range(len(x_train)):
x = x_train[i]; y = y_train[i]
# standardizing input to range 0 to 1
X = x.reshape(1,784) /255.
# forward pass through network
Y_hat = nn.forward(X)
# get pred number
pred_num = np.argmax(Y_hat)
# check if prediction was accurate
if pred_num == y:
train_correct += 1
# make a one hot categorical vector; same as keras.utils.to_categorical()
zeros = np.zeros(10); zeros[y] = 1
Y = zeros
# compute gradients and update weights
train_cost += nn.backward(Y_hat, Y)
test_correct = 0
# validation loop
for i in range(len(x_test)):
x = x_test[i]; y = y_test[i]
# standardizing input to range 0 to 1
X = x.reshape(1,784) /255.
# forward pass
Y_hat = nn.forward(X)
# get pred number
pred_num = np.argmax(Y_hat)
# check if prediction was correct
if pred_num == y:
test_correct += 1
# no backward pass here!
# compute average metrics for train and test
train_correct = round(100*(train_correct/num_train), 2)
test_correct = round(100*(test_correct/num_test ), 2)
train_cost = round( train_cost/num_train, 2)
# print status message every epoch
log_message = 'Epoch: {epoch}, Train Accuracy: {train_acc}%, Train Cost: {train_cost}, Test Accuracy: {test_acc}%'.format(
epoch=epoch,
train_acc=train_correct,
train_cost=train_cost,
test_acc=test_correct
)
print (log_message)
For every epoch of the 10 epochs in your loop, you are setting your train_correct and train_cost to 0, hence there is no updating after each epoch

Related

If-Else Statement in Custom Training Loop in Tensorflow

I created a model class which is a subclass of keras.Model. While training the model, I want to change the weights of the loss functions after some epochs. In order to do that I created boolean variables to my model indicating that the model should start training with additional loss function. I add a pseudo code that mainly shows what I am trying to achieve.
class MyModel(keras.Model):
self.start_loss_2 = False
def train_step(self):
# Check if training with loss_2 started
weight_loss_2 = 0.0
if self.start_loss_2:
weight_loss_2 = 0.5
# Pass the data through model
# Calculate two loss values
total_loss = loss_1 + weight_loss_2 * loss_2
# Calculate gradients with tf.Tape
# Update variables
# This is called via Callback after each epoch
def epoch_finised(epoch_num):
if epoch_num > START_LOSS_2:
self.start_loss_2 = True
My questions is:
Is it valid to use if-else statement whose value changes after some time? If it is not, how can achieve this?
Yes. You can create a tf.Variable and then assign a new value to it based on some training criteria.
Example:
import numpy as np
import tensorflow as tf
# simple toy network
x_in = tf.keras.Input((10))
x = tf.keras.layers.Dense(25)(x_in)
x_out = tf.keras.layers.Dense(1)(x)
# model
m = tf.keras.Model(x_in, x_out)
# fake data
X = tf.random.normal((100, 10))
y0 = tf.random.normal((100, ))
y1 = tf.random.normal((100, ))
# optimizer
m_opt = tf.keras.optimizers.Adam(1e-2)
# prep data
ds = tf.data.Dataset.from_tensor_slices((X, y0, y1))
ds = ds.repeat().batch(5)
train_iter = iter(ds)
# toy loss function that uses a weight
def loss_fn(y_true0, y_true1, y_pred, weight):
mse = tf.keras.losses.MSE
mse_0 = tf.math.reduce_mean(mse(y_true0, y_pred))
mse_1 = tf.math.reduce_mean(mse(y_true1, y_pred))
return mse_0 + weight * mse_1
NUM_EPOCHS = 4
NUM_BATCHES_PER_EPOCH = 10
START_NEW_LOSS_AT_GLOBAL_STEP = 20
# the weight variable set to 0 initially and then
# will be changed after a certain number of steps
# (or some other training criteria)
w = tf.Variable(0.0, trainable=False)
for epoch in range(NUM_EPOCHS):
losses = []
for batch in range(NUM_BATCHES_PER_EPOCH):
X_train, y0_train, y1_train = next(train_iter)
with tf.GradientTape() as tape:
y_hat = m(X_train)
loss = loss_fn(y0_train, y1_train, y_hat, w)
losses.append(loss)
m_vars = m.trainable_variables
m_grads = tape.gradient(loss, m_vars)
m_opt.apply_gradients(zip(m_grads, m_vars))
print(f"epoch: {epoch}\tloss: {np.mean(losses):.4f}")
losses = []
# if the criteria is met assign a huge number to see if the
# loss spikes up
if (epoch + 1) * (batch + 1) >= START_NEW_LOSS_AT_GLOBAL_STEP:
w.assign(10000.0)
# epoch: 0 loss: 1.8226
# epoch: 1 loss: 1.1143
# epoch: 2 loss: 8788.2227 <= looks like assign worked
# epoch: 3 loss: 10999.5449

Problem with implementation of Multilayer perceptron

I am trying to create a multi-layered perceptron for the purpose of classifying a dataset of hand drawn digits obtained from the MNIST database. It implements 2 hidden layers that have a sigmoid activation function while the output layer utilizes SoftMax. However, for whatever reason I am not able to get it to work. I have attached the training loop from my code below, this I am confident is where the problems stems from. Can anyone identify possible issues with my implementation of the perceptron?
def train(self, inputs, targets, eta, niterations):
"""
inputs is a numpy array of shape (num_train, D) containing the training images
consisting of num_train samples each of dimension D.
targets is a numpy array of shape (num_train, D) containing the training labels
consisting of num_train samples each of dimension D.
eta is the learning rate for optimization
niterations is the number of iterations for updating the weights
"""
ndata = np.shape(inputs)[0] # number of data samples
# adding the bias
inputs = np.concatenate((inputs, -np.ones((ndata, 1))), axis=1)
# numpy array to store the update weights
updatew1 = np.zeros((np.shape(self.weights1)))
updatew2 = np.zeros((np.shape(self.weights2)))
updatew3 = np.zeros((np.shape(self.weights3)))
for n in range(niterations):
# forward phase
self.outputs = self.forwardPass(inputs)
# Error using the sum-of-squares error function
error = 0.5*np.sum((self.outputs-targets)**2)
if (np.mod(n, 100) == 0):
print("Iteration: ", n, " Error: ", error)
# backward phase
deltao = self.outputs - targets
placeholder = np.zeros(np.shape(self.outputs))
for j in range(np.shape(self.outputs)[1]):
y = self.outputs[:, j]
placeholder[:, j] = y * (1 - y)
for y in range(np.shape(self.outputs)[1]):
if not y == j:
placeholder[:, j] += -y * self.outputs[:, y]
deltao *= placeholder
# compute the derivative of the second hidden layer
deltah2 = np.dot(deltao, np.transpose(self.weights3))
deltah2 = self.hidden2*self.beta*(1.0-self.hidden2)*deltah2
# compute the derivative of the first hidden layer
deltah1 = np.dot(deltah2[:, :-1], np.transpose(self.weights2))
deltah1 = self.hidden1*self.beta*(1.0-self.hidden1)*deltah1
# update the weights of the three layers: self.weights1, self.weights2 and self.weights3
updatew1 = eta*(np.dot(np.transpose(inputs),deltah1[:, :-1])) + (self.momentum * updatew1)
updatew2 = eta*(np.dot(np.transpose(self.hidden1),deltah2[:, :-1])) + (self.momentum * updatew2)
updatew3 = eta*(np.dot(np.transpose(self.hidden2),deltao)) + (self.momentum * updatew3)
self.weights1 -= updatew1
self.weights2 -= updatew2
self.weights3 -= updatew3
def forwardPass(self, inputs):
"""
inputs is a numpy array of shape (num_train, D) containing the training images
consisting of num_train samples each of dimension D.
"""
# layer 1
# the forward pass on the first hidden layer with the sigmoid function
self.hidden1 = np.dot(inputs, self.weights1)
self.hidden1 = 1.0/(1.0+np.exp(-self.beta*self.hidden1))
self.hidden1 = np.concatenate((self.hidden1, -np.ones((np.shape(self.hidden1)[0], 1))), axis=1)
# layer 2
# the forward pass on the second hidden layer with the sigmoid function
self.hidden2 = np.dot(self.hidden1, self.weights2)
self.hidden2 = 1.0/(1.0+np.exp(-self.beta*self.hidden2))
self.hidden2 = np.concatenate((self.hidden2, -np.ones((np.shape(self.hidden2)[0], 1))), axis=1)
# output layer
# the forward pass on the output layer with softmax function
outputs = np.dot(self.hidden2, self.weights3)
outputs = np.exp(outputs)
outputs /= np.repeat(np.sum(outputs, axis=1),outputs.shape[1], axis=0).reshape(outputs.shape)
return outputs
Update: I have since figured something out that I messed up during the backpropagation of the SoftMax algorithm. The actual deltao should be:
deltao = self.outputs - targets
placeholder = np.zeros(np.shape(self.outputs))
for j in range(np.shape(self.outputs)[1]):
y = self.outputs[:, j]
placeholder[:, j] = y * (1 - y)
# the counter for the for loop below used to also be named y causing confusion
for i in range(np.shape(self.outputs)[1]):
if not i == j:
placeholder[:, j] += -y * self.outputs[:, i]
deltao *= placeholder
After this correction the overflow errors have seemed to have sorted themselves however, there is now a new problem, no matter my efforts the accuracy of the perceptron does not exceed 15% no matter what variables I change
Second Update: After a long time I have finally found a way to get my code to work. I had to change the backpropogation of SoftMax (in code this is called deltao) to the following:
deltao = np.exp(self.outputs)
deltao/=np.repeat(np.sum(deltao,axis=1),deltao.shape[1]).reshape(deltao.shape)
deltao = deltao * (1 - deltao)
deltao *= (self.outputs - targets)/np.shape(inputs)[0]
Only problem is I have no idea why this works as a derivative of SoftMax could anyone explain this?

How to update the learning rate in a two layered multi-layered perceptron?

Given the XOR problem:
X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T
And a simple
two layered Multi-Layered Perceptron (MLP) with
sigmoid activations between them and
Mean Square Error (MSE) as the loss function/optimization criterion
[code]:
def sigmoid(x): # Returns values that sums to one.
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(sx): # For backpropagation.
# See https://math.stackexchange.com/a/1225116
return sx * (1 - sx)
# Cost functions.
def mse(predicted, truth):
return np.sum(np.square(truth - predicted))
X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T
# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Initialize weights between the input layers and the hidden layer.
W1 = np.random.random((input_dim, hidden_dim))
# Define the shape of the output vector.
output_dim = len(Y.T)
# Initialize weights between the hidden layers and the output layer.
W2 = np.random.random((hidden_dim, output_dim))
And given the stopping criteria as a fixed no. of epochs (no. of iterations through the X and Y) with a fixed learning rate of 0.3:
# Initialize weigh
num_epochs = 10000
learning_rate = 0.3
When I run through the forward-backward propagation and update the weights in each epoch, how should I update the weights?
I tried to simply add the product of the learning rate with the dot product of the backpropagated derivative with the layer outputs but the model still only updated the weights in one direction causing all the weights to degrade to near zero.
for epoch_n in range(num_epochs):
layer0 = X
# Forward propagation.
# Inside the perceptron, Step 2.
layer1 = sigmoid(np.dot(layer0, W1))
layer2 = sigmoid(np.dot(layer1, W2))
# Back propagation (Y -> layer2)
# How much did we miss in the predictions?
layer2_error = mse(layer2, Y)
#print(layer2_error)
# In what direction is the target value?
# Were we really close? If so, don't change too much.
layer2_delta = layer2_error * sigmoid_derivative(layer2)
# Back propagation (layer2 -> layer1)
# How much did each layer1 value contribute to the layer2 error (according to the weights)?
layer1_error = np.dot(layer2_delta, W2.T)
layer1_delta = layer1_error * sigmoid_derivative(layer1)
# update weights
W2 += - learning_rate * np.dot(layer1.T, layer2_delta)
W1 += - learning_rate * np.dot(layer0.T, layer1_delta)
#print(np.dot(layer0.T, layer1_delta))
#print(epoch_n, list((layer2)))
# Log the loss value as we proceed through the epochs.
losses.append(layer2_error.mean())
How should the weights be updated correctly?
Full code:
from itertools import chain
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0)
def sigmoid(x): # Returns values that sums to one.
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(sx):
# See https://math.stackexchange.com/a/1225116
return sx * (1 - sx)
# Cost functions.
def mse(predicted, truth):
return np.sum(np.square(truth - predicted))
X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T
# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Initialize weights between the input layers and the hidden layer.
W1 = np.random.random((input_dim, hidden_dim))
# Define the shape of the output vector.
output_dim = len(Y.T)
# Initialize weights between the hidden layers and the output layer.
W2 = np.random.random((hidden_dim, output_dim))
# Initialize weigh
num_epochs = 10000
learning_rate = 0.3
losses = []
for epoch_n in range(num_epochs):
layer0 = X
# Forward propagation.
# Inside the perceptron, Step 2.
layer1 = sigmoid(np.dot(layer0, W1))
layer2 = sigmoid(np.dot(layer1, W2))
# Back propagation (Y -> layer2)
# How much did we miss in the predictions?
layer2_error = mse(layer2, Y)
#print(layer2_error)
# In what direction is the target value?
# Were we really close? If so, don't change too much.
layer2_delta = layer2_error * sigmoid_derivative(layer2)
# Back propagation (layer2 -> layer1)
# How much did each layer1 value contribute to the layer2 error (according to the weights)?
layer1_error = np.dot(layer2_delta, W2.T)
layer1_delta = layer1_error * sigmoid_derivative(layer1)
# update weights
W2 += - learning_rate * np.dot(layer1.T, layer2_delta)
W1 += - learning_rate * np.dot(layer0.T, layer1_delta)
#print(np.dot(layer0.T, layer1_delta))
#print(epoch_n, list((layer2)))
# Log the loss value as we proceed through the epochs.
losses.append(layer2_error.mean())
# Visualize the losses
plt.plot(losses)
plt.show()
Am I missing anything in the backpropagation?
Maybe I missed out the derivative from the cost to the second layer?
Edited
I realized I missed the partial derivative from the cost to the second layer and after adding it:
# Cost functions.
def mse(predicted, truth):
return 0.5 * np.sum(np.square(predicted - truth)).mean()
def mse_derivative(predicted, truth):
return predicted - truth
With the updated backpropagation loop across epochs:
for epoch_n in range(num_epochs):
layer0 = X
# Forward propagation.
# Inside the perceptron, Step 2.
layer1 = sigmoid(np.dot(layer0, W1))
layer2 = sigmoid(np.dot(layer1, W2))
# Back propagation (Y -> layer2)
# How much did we miss in the predictions?
cost_error = mse(layer2, Y)
cost_delta = mse_derivative(layer2, Y)
#print(layer2_error)
# In what direction is the target value?
# Were we really close? If so, don't change too much.
layer2_error = np.dot(cost_delta, cost_error)
layer2_delta = layer2_error * sigmoid_derivative(layer2)
# Back propagation (layer2 -> layer1)
# How much did each layer1 value contribute to the layer2 error (according to the weights)?
layer1_error = np.dot(layer2_delta, W2.T)
layer1_delta = layer1_error * sigmoid_derivative(layer1)
# update weights
W2 += - learning_rate * np.dot(layer1.T, layer2_delta)
W1 += - learning_rate * np.dot(layer0.T, layer1_delta)
It seemed to train and learn the XOR...
But now the question begets, is the layer2_error and layer2_delta computed correctly, i.e. is the following part of the code correct?
# How much did we miss in the predictions?
cost_error = mse(layer2, Y)
cost_delta = mse_derivative(layer2, Y)
#print(layer2_error)
# In what direction is the target value?
# Were we really close? If so, don't change too much.
layer2_error = np.dot(cost_delta, cost_error)
layer2_delta = layer2_error * sigmoid_derivative(layer2)
Is it correct to do a dot product on the cost_delta and cost_error for the layer2_error? Or would layer2_error just be equals to cost_delta?
I.e.
# How much did we miss in the predictions?
cost_error = mse(layer2, Y)
cost_delta = mse_derivative(layer2, Y)
#print(layer2_error)
# In what direction is the target value?
# Were we really close? If so, don't change too much.
layer2_error = cost_delta
layer2_delta = layer2_error * sigmoid_derivative(layer2)
Yes, it is correct to multiply the residuals (cost_error) with the delta values when we update the weights.
However, it doesn't really matter whether do dot product or not since it cost_error is a scalar. So, a simple multiplication is enough. But, we definitely have to multiply the gradient of the cost function because that's where we start our backprop (i.e. it's the entry point for backward pass).
Also, the below function can be simplified:
def mse(predicted, truth):
return 0.5 * np.sum(np.square(predicted - truth)).mean()
as
def mse(predicted, truth):
return 0.5 * np.mean(np.square(predicted - truth))

What's wrong with my backpropagation?

I'm trying to code a neural network from scratch in python. To check whether everything works I wanted to overfit the network but the loss seems to explode at first and then comes back to the initial value and stops there (Doesn't converge). I've checked my code and could find the reason. I assume my understanding or implementation of backpropagation is incorrect but there might be some other reason. Can anyone help me out or at least point me in the right direction?
# Initialize weights and biases given dimesnsions (For this example the dimensions are set to [12288, 64, 1])
def initialize_parameters(dims):
# Initiate parameters
parameters = {}
L = len(dims) # Number of layers in the network
# Loop over the given dimensions. Initialize random weights and set biases to zero.
for i in range(1, L):
parameters["W" + str(i)] = np.random.randn(dims[i], dims[i-1]) * 0.01
parameters["b" + str(i)] = np.zeros([dims[i], 1])
return parameters
# Activation Functions
def relu(x, deriv=False):
if deriv:
return 1. * (x > 0)
else:
return np.maximum(0,x)
def sigmoid(x, deriv=False):
if deriv:
return x * (1-x)
else:
return 1/(1 + np.exp(-x))
# Forward and backward pass for 2 layer neural network. (1st relu, 2nd sigmoid)
def forward_backward(X, Y, parameters):
# Array for storing gradients
grads = {}
# Get the length of examples
m = Y.shape[1]
# First layer
Z1 = np.dot(parameters["W1"], X) + parameters["b1"]
A1 = relu(Z1)
# Second layer
Z2 = np.dot(parameters["W2"], A1) + parameters["b2"]
AL = sigmoid(Z2)
# Compute cost
cost = (-1 / m) * np.sum(np.multiply(Y, np.log(AL)) + np.multiply(1 - Y, np.log(1 - AL)))
# Backpropagation
# Second Layer
dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
dZ2 = dAL * sigmoid(AL, deriv=True)
grads["dW2"] = np.dot(dZ2, A1.T) / m
grads["db2"] = np.sum(dZ2, axis=1, keepdims=True) / m
# First layer
dA1 = np.dot(parameters["W2"].T, dZ2)
dZ1 = dA1 * relu(A1, deriv=True)
grads["dW1"] = np.dot(dZ1, X.T)
grads["db1"] = np.sum(dZ1, axis=1, keepdims=True) / m
return AL, grads, cost
# Hyperparameters
dims = [12288, 64, 1]
epoches = 2000
learning_rate = 0.1
# Initialize parameters
parameters = initialize_parameters(dims)
log_list = []
# Train the network
for i in range(epoches):
# Get X and Y
x = np.array(train[0:10],ndmin=2).T
y = np.array(labels[0:10], ndmin=2).T
# Perform forward and backward pass
AL, grads, cost = forward_backward(x, y, parameters)
# Compute cost and append to the log_list
log_list.append(cost)
# Update parameters with computed gradients
parameters = update_parameters(grads, parameters, learning_rate)
plt.plot(log_list)
plt.title("Loss of the network")
plt.show()
I am struggling to find the place where you calculate the error gradients and the input training data sample would also help...
I don't know if this will help you, but I'll share my solution for Python neural network to learn XOR problem.
import numpy as np
def sigmoid_function(x, derivative=False):
"""
Sigmoid function
“x” is the input and “y” the output, the nonlinear properties of this function means that
the rate of change is slower at the extremes and faster in the centre. Put plainly,
we want the neuron to “make its mind up” instead of indecisively staying in the middle.
:param x: Float
:param Derivative: Boolean
:return: Float
"""
if (derivative):
return x * (1 - x) # Derivative using the chain rule.
else:
return 1 / (1 + np.exp(-x))
# create dataset for XOR problem
input_data = np.array([[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]])
ideal_output = np.array([[0.0], [1.0], [1.0], [0.0]])
#initialize variables
learning_rate = 0.1
epoch = 50000 #number or iterations basically - One round of forward and back propagation is called an epoch
# get the second element from the numpy array shape field to detect the count of features for input layer
input_layer_neurons = input_data.shape[1]
hidden_layer_neurons = 3 #number of hidden layer neurons
output_layer_neurons = 1 #number of output layer neurons
#init weight & bias
weights_hidden = np.random.uniform(size=(input_layer_neurons, hidden_layer_neurons))
bias_hidden = np.random.uniform(1, hidden_layer_neurons)
weights_output = np.random.uniform(size=(hidden_layer_neurons, output_layer_neurons))
bias_output = np.random.uniform(1, output_layer_neurons)
for i in range(epoch):
#forward propagation
hidden_layer_input_temp = np.dot(input_data, weights_hidden) #matrix dot product to adjust for weights in the layer
hidden_layer_input = hidden_layer_input_temp + bias_hidden #adjust for bias
hidden_layer_activations = sigmoid_function(hidden_layer_input) #use the activation function
output_layer_input_temp = np.dot(hidden_layer_activations, weights_output)
output_layer_input = output_layer_input_temp + bias_output
output = sigmoid_function(output_layer_input) #final output
#backpropagation (where adjusting of the weights happens)
error = ideal_output - output #error gradient
if (i % 1000 == 0):
print("Error: {}".format(np.mean(abs(error))))
#use derivatives to compute slope of output and hidden layers
slope_output_layer = sigmoid_function(output, derivative=True)
slope_hidden_layer = sigmoid_function(hidden_layer_activations, derivative=True)
#calculate deltas
delta_output = error * slope_output_layer
error_hidden_layer = delta_output.dot(weights_output.T) #calculates the error at hidden layer
delta_hidden = error_hidden_layer * slope_hidden_layer
#change the weights
weights_output += hidden_layer_activations.T.dot(delta_output) * learning_rate
bias_output += np.sum(delta_output, axis=0, keepdims=True) * learning_rate
weights_hidden += input_data.T.dot(delta_hidden) * learning_rate
bias_hidden += np.sum(delta_hidden, axis=0, keepdims=True) * learning_rate

Calculate optimal input of a neural network with theano, by using gradient descent w.r.t. inputs

I have implemented and trained a neural network with Theano of k binary inputs (0,1), one hidden layer and one unit in the output layer. Once it has been trained I want to obtain inputs that maximizes the output (e.g. x which makes unit of output layer closest to 1). So far I haven't found an implementation of it, so I am trying the following approach:
Train network => obtain trained weights (theta1, theta2)
Define the neural network function with x as input and trained theta1, theta2 as fixed parameters. That is: f(x) = sigmoid( theta1*(sigmoid (theta2*x ))). This function takes x and with given trained weights (theta1, theta2) gives output between 0 and 1.
Apply gradient descent w.r.t. x on the neural network function f(x) and obtain x that maximizes f(x) with theta1 and theta2 given.
For these I have implemented the following code with a toy example (k = 2). Based on the tutorial on http://outlace.com/Beginner-Tutorial-Theano/ but changed vector y, so that there is only one combination of inputs that gives f(x) ~ 1 which is x = [0, 1].
Edit1: As suggested optimizer was set to None and bias unit was fixed to 1.
Step 1: Train neural network. This runs well and with out error.
import os
os.environ["THEANO_FLAGS"] = "optimizer=None"
import theano
import theano.tensor as T
import theano.tensor.nnet as nnet
import numpy as np
x = T.dvector()
y = T.dscalar()
def layer(x, w):
b = np.array([1], dtype=theano.config.floatX)
new_x = T.concatenate([x, b])
m = T.dot(w.T, new_x) #theta1: 3x3 * x: 3x1 = 3x1 ;;; theta2: 1x4 * 4x1
h = nnet.sigmoid(m)
return h
def grad_desc(cost, theta):
alpha = 0.1 #learning rate
return theta - (alpha * T.grad(cost, wrt=theta))
in_units = 2
hid_units = 3
out_units = 1
theta1 = theano.shared(np.array(np.random.rand(in_units + 1, hid_units), dtype=theano.config.floatX)) # randomly initialize
theta2 = theano.shared(np.array(np.random.rand(hid_units + 1, out_units), dtype=theano.config.floatX))
hid1 = layer(x, theta1) #hidden layer
out1 = T.sum(layer(hid1, theta2)) #output layer
fc = (out1 - y)**2 #cost expression
cost = theano.function(inputs=[x, y], outputs=fc, updates=[
(theta1, grad_desc(fc, theta1)),
(theta2, grad_desc(fc, theta2))])
run_forward = theano.function(inputs=[x], outputs=out1)
inputs = np.array([[0,1],[1,0],[1,1],[0,0]]).reshape(4,2) #training data X
exp_y = np.array([1, 0, 0, 0]) #training data Y
cur_cost = 0
for i in range(5000):
for k in range(len(inputs)):
cur_cost = cost(inputs[k], exp_y[k]) #call our Theano-compiled cost function, it will auto update weights
print(run_forward([0,1]))
Output of run forward for [0,1] is: 0.968905860574.
We can also get values of weights with theta1.get_value() and theta2.get_value()
Step 2: Define neural network function f(x). Trained weights (theta1, theta2) are constant parameters of this function.
Things get a little trickier here because of the bias unit, which is part of he vector of inputs x. To do this I concatenate b and x. But the code now runs well.
b = np.array([[1]], dtype=theano.config.floatX)
#b_sh = theano.shared(np.array([[1]], dtype=theano.config.floatX))
rand_init = np.random.rand(in_units, 1)
rand_init[0] = 1
x_sh = theano.shared(np.array(rand_init, dtype=theano.config.floatX))
th1 = T.dmatrix()
th2 = T.dmatrix()
nn_hid = T.nnet.sigmoid( T.dot(th1, T.concatenate([x_sh, b])) )
nn_predict = T.sum( T.nnet.sigmoid( T.dot(th2, T.concatenate([nn_hid, b]))))
Step 3:
Problem is now in gradient descent as is not limited to values between 0 and 1.
fc2 = (nn_predict - 1)**2
cost3 = theano.function(inputs=[th1, th2], outputs=fc2, updates=[
(x_sh, grad_desc(fc2, x_sh))])
run_forward = theano.function(inputs=[th1, th2], outputs=nn_predict)
cur_cost = 0
for i in range(10000):
cur_cost = cost3(theta1.get_value().T, theta2.get_value().T) #call our Theano-compiled cost function, it will auto update weights
if i % 500 == 0: #only print the cost every 500 epochs/iterations (to save space)
print('Cost: %s' % (cur_cost,))
print x_sh.get_value()
The last iteration prints:
Cost: 0.000220317356533
[[-0.11492753]
[ 1.99729555]]
Furthermore input 1 keeps becoming more negative and input 2 increases, while the optimal solution is [0, 1]. How can this be fixed?
You are adding b=[1] via broadcasting rules as opposed to concatenating it. Also, once you concatenate it, your x_sh has one dimension to many which is why the error occurs at nn_predict and not nn_hid

Categories

Resources