I'm trying to build a simple one layer neural network(NN) using tensorflow operations. For different reasons I'm not using the Keras API.
I'm trying to write the code for backpropogation, however I'm not sure my calculations are entirely correct. I have two concerns with my code; one is that I used a tanh activation function in my NN, and I'm not sure my derivative for the activation(g) is correct.
Two, I don't know how many training samples(m) my network is going to see for the duration of training. Do I need to know m to get the derivative of w? Or is there another way to derive w?
def linear_forward(x, y, b, w):
# linear activation
z = tf.tensordot(w, x, axes=1) + b
# activation
g = tf.math.tanh(tf.tensordot(w, x, axes=1) + b)
loss = tf.math.reduce_mean(-tf.losses.mse(y, g))
return (z, g, w, loss)
def linear_backward(g, y, x):
# mse derivate
dMSE = 2*(y-g)
# tanh derivative, is it correct?
dZ = dMSE-g**2
# weights derivative
dW = np.dot(dZ, x.T) / m # not sure what m is going to be?
db = np.sum(dZ, axis=1, keepdims=True) / m
return dW, db
# network parameters
learning_rate = .001
weights = tf.random.uniform((9, 9), dtype=tf.float32)
bias = tf.random.uniform([9], dtype=tf.float32)
# network inputs
x = tf.random.uniform([9], dtype=tf.float32)
y = tf.constant([0, .1, .2, .3, .4, .5, .6, .7, .8])
parameters = linear_forward(x, y, bias, weights)
dW, db = linear_backward(parameters[1], y, x)
# update weights and bias
weights = weights - learning_rate * dW
bias = bias - learning_rate * db
Related
I am currently in the process of trying to implement my own neural network from scratch to test my understanding of the method. I thought things were going well, as my network managed to approximate AND and XOR functions without an issue, but it turns out it is having a problem with learning to approximate a simple square function.
I have attempted to use a variety of different network configurations, with anywhere from 1 to 3 layers, and 1-64 nodes. I have varied the learning rate from 0.1 to 0.00000001, and implemented weight decay as I thought some regularisation might provide some insight as to what went wrong. I have also implemented gradient check, which is giving me conflicting answers, as it varies greatly from attempt to attempt, ranging from a dreadful 0.6 difference to a fantastic 1e-10. I am using the leaky ReLU activation function, and MSE as my cost function.
Could somebody help me spot what I am missing? Or is this purely down to optimising the hyper parameters?
My code is as follows:
import matplotlib.pyplot as plt
import numpy as np
import Sub_Script as ss
# Create sample data set using X**2
X = np.expand_dims(np.linspace(0, 1, 201), axis=0)
y = X**2
plt.plot(X.T, y.T)
# Hyper-parameters
layer_dims = [1, 64, 1]
learning_rate = 0.000001
iterations = 50000
decay = 0.00000001
num_ex = y.shape[1]
# Initializations
num_layers = len(layer_dims)
weights = [None] + [np.random.randn(layer_dims[l], layer_dims[l-1])*np.sqrt(2/layer_dims[l-1])for l in range(1, num_layers)]
biases = [None] + [np.zeros((layer_dims[l], 1)) for l in range(1, num_layers)]
dweights, dbiases, dw_approx, db_approx = ss.grad_check(weights, biases, num_layers, X, y, decay, num_ex)
# Main function: Iteration loop
for iter in range(iterations):
# Main function: Forward Propagation
z_values, acts = ss.forward_propagation(weights, biases, num_layers, X)
dweights, dbiases = ss.backward_propagation(weights, biases, num_layers, z_values, acts, y)
weights, biases = ss.update_paras(weights, biases, dweights, dbiases, learning_rate, decay, num_ex)
if iter % (1000+1) == 0:
print('Cost: ', ss.mse(acts[-1], y, weights, decay, num_ex))
# Gradient Checking
dweights, dbiases, dw_approx, db_approx = ss.grad_check(weights, biases, num_layers, X, y, decay, num_ex)
# Visualization
plt.plot(X.T, acts[-1].T)
With Sub_Script.py containing the neural network functions:
import numpy as np
import copy as cp
# Construct sub functions, forward, backward propagation and cost and activation functions
# Leaky ReLU Activation Function
def relu(x):
return (x > 0) * x + (x < 0) * 0.01*x
# Leaky ReLU activation Function Gradient
def relu_grad(x):
return (x > 0) + (x < 0) * 0.01
# MSE Cost Function
def mse(prediction, actual, weights, decay, num_ex):
return np.sum((actual - prediction) ** 2)/(actual.shape[1]) + (decay/(2*num_ex))*np.sum([np.sum(w) for w in weights[1:]])
# MSE Cost Function Gradient
def mse_grad(prediction, actual):
return -2 * (actual - prediction)/(actual.shape[1])
# Forward Propagation
def forward_propagation(weights, biases, num_layers, act):
acts = [[None] for i in range(num_layers)]
z_values = [[None] for i in range(num_layers)]
acts[0] = act
for layer in range(1, num_layers):
z_values[layer] = np.dot(weights[layer], acts[layer-1]) + biases[layer]
acts[layer] = relu(z_values[layer])
return z_values, acts
# Backward Propagation
def backward_propagation(weights, biases, num_layers, z_values, acts, y):
dweights = [[None] for i in range(num_layers)]
dbiases = [[None] for i in range(num_layers)]
zgrad = mse_grad(acts[-1], y) * relu_grad(z_values[-1])
dweights[-1] = np.dot(zgrad, acts[-2].T)
dbiases[-1] = np.sum(zgrad, axis=1, keepdims=True)
for layer in range(num_layers-2, 0, -1):
zgrad = np.dot(weights[layer+1].T, zgrad) * relu_grad(z_values[layer])
dweights[layer] = np.dot(zgrad, acts[layer-1].T)
dbiases[layer] = np.sum(zgrad, axis=1, keepdims=True)
return dweights, dbiases
# Update Parameters with Regularization
def update_paras(weights, biases, dweights, dbiases, learning_rate, decay, num_ex):
weights = [None] + [w - learning_rate*(dw + (decay/num_ex)*w) for w, dw in zip(weights[1:], dweights[1:])]
biases = [None] + [b - learning_rate*db for b, db in zip(biases[1:], dbiases[1:])]
return weights, biases
# Gradient Checking
def grad_check(weights, biases, num_layers, X, y, decay, num_ex):
z_values, acts = forward_propagation(weights, biases, num_layers, X)
dweights, dbiases = backward_propagation(weights, biases, num_layers, z_values, acts, y)
epsilon = 1e-7
dw_approx = cp.deepcopy(weights)
db_approx = cp.deepcopy(biases)
for layer in range(1, num_layers):
height = weights[layer].shape[0]
width = weights[layer].shape[1]
for i in range(height):
for j in range(width):
w_plus = cp.deepcopy(weights)
w_plus[layer][i, j] += epsilon
w_minus = cp.deepcopy(weights)
w_minus[layer][i, j] -= epsilon
_, temp_plus = forward_propagation(w_plus, biases, num_layers, X)
cost_plus = mse(temp_plus[-1], y, w_plus, decay, num_ex)
_, temp_minus = forward_propagation(w_minus, biases, num_layers, X)
cost_minus = mse(temp_minus[-1], y, w_minus, decay, num_ex)
dw_approx[layer][i, j] = (cost_plus - cost_minus)/(2*epsilon)
b_plus = cp.deepcopy(biases)
b_plus[layer][i, 0] += epsilon
b_minus = cp.deepcopy(biases)
b_minus[layer][i, 0] -= epsilon
_, temp_plus = forward_propagation(weights, b_plus, num_layers, X)
cost_plus = mse(temp_plus[-1], y, weights, decay, num_ex)
_, temp_minus = forward_propagation(weights, b_minus, num_layers, X)
cost_minus = mse(temp_minus[-1], y, weights, decay, num_ex)
db_approx[layer][i, 0] = (cost_plus - cost_minus)/(2*epsilon)
dweights_flat = [dw.flatten() for dw in dweights[1:]]
dweights_flat = np.concatenate(dweights_flat, axis=None)
dw_approx_flat = [dw.flatten() for dw in dw_approx[1:]]
dw_approx_flat = np.concatenate(dw_approx_flat, axis=None)
dbiases_flat = [db.flatten() for db in dbiases[1:]]
dbiases_flat = np.concatenate(dbiases_flat, axis=None)
db_approx_flat = [db.flatten() for db in db_approx[1:]]
db_approx_flat = np.concatenate(db_approx_flat, axis=None)
d_paras = np.concatenate([dweights_flat, dbiases_flat], axis=None)
d_approx_paras = np.concatenate([dw_approx_flat, db_approx_flat], axis=None)
difference = np.linalg.norm(d_paras - d_approx_paras)/(np.linalg.norm(d_paras) +
np.linalg.norm(d_approx_paras))
if difference > 2e-7:
print(
"\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
else:
print(
"\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")
return dweights, dbiases, dw_approx, db_approx
Edit: Made some corrections to some old comments I had in the code, to avoid confusion
Edit 2: Thanks to #sid_508 for helping me find the main problem with my code! I also wanted to mention in this edit that I found out that there was some mistake in the way I had implemented the weight decay. After making the suggested changes and removing the weight decay element entirely for now, the neural network appears to work!
I ran your code and this is the output it gave:
The issue is that you use ReLU for the final layer too, so you can't get the best fit, use no activation in the final layer and it should produce way better results.
The final layer activation usually always varies from what you use for the hidden layers and it depends on what type of output you are going for. For continuous outputs use linear activation (basically no activation), and for classification use sigmoid/softmax.
I am referring to this Variational Autoencoder code in github for CelebA dataset. There is an encoder and decoder as usual. The encoder outputs x which is used to get the log variance and mean of the posterior distribution. Now the reparameterization trick gives us the encoded
z = sigma * u + mean
where u comes from normal distribution with mean = 0 and variance = 1. I can figure all of this in the code. However in the 'ScaleShift' class, which is a new keras layer, I don't understand why we are using add_loss(lodget). I checked the Keras documentation for writing your own layers and there doesn't seem to be a requirement for defining loss for a layer. I am confused by what this line's functionality is.
#new keras layer
class ScaleShift(Layer):
def __init__(self, **kwargs):
super(ScaleShift, self).__init__(**kwargs)
def call(self, inputs):
z, shift, log_scale = inputs
#reparameterization trick
z = K.exp(log_scale) * z + shift
#what are the next two lines doing?
logdet = -K.sum(K.mean(log_scale, 0))
self.add_loss(logdet)
return z
#gets the mean parameter(z_shift) from x, and similarly the log variance parameter(z_log_scale)
z_shift = Dense(z_dim)(x)
z_log_scale = Dense(z_dim)(x)
#keras lambda layer, spits out random normal variable u of same dimension as z_shift
u = Lambda(lambda z: K.random_normal(shape=K.shape(z)))(z_shift)
z = ScaleShift()([u, z_shift, z_log_scale])
x_recon = decoder(z)
x_out = Subtract()([x_in, x_recon])
recon_loss = 0.5 * K.sum(K.mean(x_out**2, 0)) + 0.5 * np.log(2*np.pi) * np.prod(K.int_shape(x_out)[1:])
#KL divergence loss
z_loss = 0.5 * K.sum(K.mean(z**2, 0)) - 0.5 * K.sum(K.mean(u**2, 0))
vae_loss = recon_loss + z_loss
vae = Model(x_in, x_out)
#VAE loss to be optimised
vae.add_loss(vae_loss)
vae.compile(optimizer=Adam(1e-4))
I'm trying to code a neural network from scratch in python. To check whether everything works I wanted to overfit the network but the loss seems to explode at first and then comes back to the initial value and stops there (Doesn't converge). I've checked my code and could find the reason. I assume my understanding or implementation of backpropagation is incorrect but there might be some other reason. Can anyone help me out or at least point me in the right direction?
# Initialize weights and biases given dimesnsions (For this example the dimensions are set to [12288, 64, 1])
def initialize_parameters(dims):
# Initiate parameters
parameters = {}
L = len(dims) # Number of layers in the network
# Loop over the given dimensions. Initialize random weights and set biases to zero.
for i in range(1, L):
parameters["W" + str(i)] = np.random.randn(dims[i], dims[i-1]) * 0.01
parameters["b" + str(i)] = np.zeros([dims[i], 1])
return parameters
# Activation Functions
def relu(x, deriv=False):
if deriv:
return 1. * (x > 0)
else:
return np.maximum(0,x)
def sigmoid(x, deriv=False):
if deriv:
return x * (1-x)
else:
return 1/(1 + np.exp(-x))
# Forward and backward pass for 2 layer neural network. (1st relu, 2nd sigmoid)
def forward_backward(X, Y, parameters):
# Array for storing gradients
grads = {}
# Get the length of examples
m = Y.shape[1]
# First layer
Z1 = np.dot(parameters["W1"], X) + parameters["b1"]
A1 = relu(Z1)
# Second layer
Z2 = np.dot(parameters["W2"], A1) + parameters["b2"]
AL = sigmoid(Z2)
# Compute cost
cost = (-1 / m) * np.sum(np.multiply(Y, np.log(AL)) + np.multiply(1 - Y, np.log(1 - AL)))
# Backpropagation
# Second Layer
dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
dZ2 = dAL * sigmoid(AL, deriv=True)
grads["dW2"] = np.dot(dZ2, A1.T) / m
grads["db2"] = np.sum(dZ2, axis=1, keepdims=True) / m
# First layer
dA1 = np.dot(parameters["W2"].T, dZ2)
dZ1 = dA1 * relu(A1, deriv=True)
grads["dW1"] = np.dot(dZ1, X.T)
grads["db1"] = np.sum(dZ1, axis=1, keepdims=True) / m
return AL, grads, cost
# Hyperparameters
dims = [12288, 64, 1]
epoches = 2000
learning_rate = 0.1
# Initialize parameters
parameters = initialize_parameters(dims)
log_list = []
# Train the network
for i in range(epoches):
# Get X and Y
x = np.array(train[0:10],ndmin=2).T
y = np.array(labels[0:10], ndmin=2).T
# Perform forward and backward pass
AL, grads, cost = forward_backward(x, y, parameters)
# Compute cost and append to the log_list
log_list.append(cost)
# Update parameters with computed gradients
parameters = update_parameters(grads, parameters, learning_rate)
plt.plot(log_list)
plt.title("Loss of the network")
plt.show()
I am struggling to find the place where you calculate the error gradients and the input training data sample would also help...
I don't know if this will help you, but I'll share my solution for Python neural network to learn XOR problem.
import numpy as np
def sigmoid_function(x, derivative=False):
"""
Sigmoid function
“x” is the input and “y” the output, the nonlinear properties of this function means that
the rate of change is slower at the extremes and faster in the centre. Put plainly,
we want the neuron to “make its mind up” instead of indecisively staying in the middle.
:param x: Float
:param Derivative: Boolean
:return: Float
"""
if (derivative):
return x * (1 - x) # Derivative using the chain rule.
else:
return 1 / (1 + np.exp(-x))
# create dataset for XOR problem
input_data = np.array([[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]])
ideal_output = np.array([[0.0], [1.0], [1.0], [0.0]])
#initialize variables
learning_rate = 0.1
epoch = 50000 #number or iterations basically - One round of forward and back propagation is called an epoch
# get the second element from the numpy array shape field to detect the count of features for input layer
input_layer_neurons = input_data.shape[1]
hidden_layer_neurons = 3 #number of hidden layer neurons
output_layer_neurons = 1 #number of output layer neurons
#init weight & bias
weights_hidden = np.random.uniform(size=(input_layer_neurons, hidden_layer_neurons))
bias_hidden = np.random.uniform(1, hidden_layer_neurons)
weights_output = np.random.uniform(size=(hidden_layer_neurons, output_layer_neurons))
bias_output = np.random.uniform(1, output_layer_neurons)
for i in range(epoch):
#forward propagation
hidden_layer_input_temp = np.dot(input_data, weights_hidden) #matrix dot product to adjust for weights in the layer
hidden_layer_input = hidden_layer_input_temp + bias_hidden #adjust for bias
hidden_layer_activations = sigmoid_function(hidden_layer_input) #use the activation function
output_layer_input_temp = np.dot(hidden_layer_activations, weights_output)
output_layer_input = output_layer_input_temp + bias_output
output = sigmoid_function(output_layer_input) #final output
#backpropagation (where adjusting of the weights happens)
error = ideal_output - output #error gradient
if (i % 1000 == 0):
print("Error: {}".format(np.mean(abs(error))))
#use derivatives to compute slope of output and hidden layers
slope_output_layer = sigmoid_function(output, derivative=True)
slope_hidden_layer = sigmoid_function(hidden_layer_activations, derivative=True)
#calculate deltas
delta_output = error * slope_output_layer
error_hidden_layer = delta_output.dot(weights_output.T) #calculates the error at hidden layer
delta_hidden = error_hidden_layer * slope_hidden_layer
#change the weights
weights_output += hidden_layer_activations.T.dot(delta_output) * learning_rate
bias_output += np.sum(delta_output, axis=0, keepdims=True) * learning_rate
weights_hidden += input_data.T.dot(delta_hidden) * learning_rate
bias_hidden += np.sum(delta_hidden, axis=0, keepdims=True) * learning_rate
I am trying to build an ANN in python, and I've been able to get so far as to to forward pass, but I get a problem when I try to do backward propagation. In my function nnCostFunction, the gradient grad is define as:
grad = tr(c_[Theta1_grad.swapaxes(1,0).reshape(1,-1), Theta2_grad.swapaxes(1,0).reshape(1,-1)])
But this is a problem because I am using scipy.optimize.fmin_cg to calculate nn_params and cost, and fmin_cg accepts only a single value (the J value for my forward pass) and cannot accept grad...
nn_params, cost = op.fmin_cg(lambda t: nnCostFunction(t, input_layer_size, hidden_layer_size, num_labels, X, y, lam), initial_nn_params, gtol = 0.001, maxiter = 40, full_output=1)[0, 1]
Is there a way to fix this so I can include backward propagation in my network? I know there is a scipy.optimize.minimize function, but I am having some difficulty understand how to use it and get the results I need. Does anyone know what needs to be done?
Your help is greatly appreciated, thanks.
def nnCostFunction(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, lam):
'''
Given NN parameters, layer sizes, number of labels, data, and learning rate, returns the cost of traversing NN.
'''
Theta1 = (reshape(nn_params[:(hidden_layer_size*(input_layer_size+1))],(hidden_layer_size,(input_layer_size+1))))
Theta2 = (reshape(nn_params[((hidden_layer_size*(input_layer_size+1))):],(num_labels, (hidden_layer_size+1))))
m = X.shape[0]
n = X.shape[1]
#forward pass
y_eye = eye(num_labels)
y_new = np.zeros((y.shape[0],num_labels))
for z in range(y.shape[0]):
y_new[z,:] = y_eye[int(y[z])-1]
y = y_new
a_1 = c_[ones((m,1)),X]
z_2 = tr(Theta1.dot(tr(a_1)))
a_2 = tr(sigmoid(Theta1.dot(tr(a_1))))
a_2 = c_[ones((a_2.shape[0],1)), a_2]
a_3 = tr(sigmoid(Theta2.dot(tr(a_2))))
J_reg = lam/(2.*m) * (sum(sum(Theta1[:,1:]**2)) + sum(sum(Theta2[:,1:]**2)))
J = (1./m) * sum(sum(-y*log(a_3) - (1-y)*log(1-a_3))) + J_reg
#Backprop
d_3 = a_3 - y
d_2 = d_3.dot(Theta2[:,1:])*sigmoidGradient(z_2)
Theta1_grad = 1./m * tr(d_2).dot(a_1)
Theta2_grad = 1./m * tr(d_3).dot(a_2)
#Add regularization
Theta1_grad[:,1:] = Theta1_grad[:,1:] + lam*1.0/m*Theta1[:,1:]
Theta2_grad[:,1:] = Theta2_grad[:,1:] + lam*1.0/m*Theta2[:,1:]
#Unroll gradients
grad = tr(c_[Theta1_grad.swapaxes(1,0).reshape(1,-1), Theta2_grad.swapaxes(1,0).reshape(1,-1)])
return J, grad
def nn_train(X,y,lam = 1.0, hidden_layer_size = 10):
'''
Train neural network given the features and class arrays, learning rate, and size of the hidden layer.
Return parameters Theta1, Theta2.
'''
# NN input and output layer sizes
input_layer_size = X.shape[1]
num_labels = unique(y).shape[0] #output layer
# Initialize NN parameters
initial_Theta1 = randInitializeWeights(input_layer_size, hidden_layer_size)
initial_Theta2 = randInitializeWeights(hidden_layer_size, num_labels)
# Unroll parameters
initial_nn_params = np.append(initial_Theta1.flatten(1), initial_Theta2.flatten(1))
initial_nn_params = reshape(initial_nn_params,(len(initial_nn_params),)) #flatten into 1-d array
# Find and print initial cost:
J_init = nnCostFunction(initial_nn_params,input_layer_size,hidden_layer_size,num_labels,X,y,lam)[0]
grad_init = nnCostFunction(initial_nn_params,input_layer_size,hidden_layer_size,num_labels,X,y,lam)[1]
print 'Initial J cost: ' + str(J_init)
print 'Initial grad cost: ' + str(grad_init)
# Implement backprop and train network, run fmin
print 'Training Neural Network...'
print 'fmin results:'
nn_params, cost = op.fmin_cg(lambda t: nnCostFunction(t, input_layer_size, hidden_layer_size, num_labels, X, y, lam), initial_nn_params, gtol = 0.001, maxiter = 40, full_output=1)[0, 1]
Theta1 = (reshape(nn_params[:(hidden_layer_size*(input_layer_size+1))],(hidden_layer_size,(input_layer_size+1))))
Theta2 = (reshape(nn_params[((hidden_layer_size*(input_layer_size+1))):],(num_labels, (hidden_layer_size+1))))
return Theta1, Theta2
I am trying to define my own RNNCell (Echo State Network) in Tensorflow, according to below definition.
x(t + 1) = tanh(Win*u(t) + W*x(t) + Wfb*y(t))
y(t) = Wout*z(t)
z(t) = [x(t), u(t)]
x is state, u is input, y is output. Win, W, and Wfb are not trainable. All weights are randomly initialized, but W is modified like this: "Set a certain percentage of elements of W to 0, scale W to keep its spectral radius below 1.0
I have this code to generate the equation.
x = tf.Variable(tf.reshape(tf.zeros([N]), [-1, N]), trainable=False, name="state_vector")
W = tf.Variable(tf.random_normal([N, N], 0.0, 0.05), trainable=False)
# TODO: setup W according to the ESN paper
W_x = tf.matmul(x, W)
u = tf.placeholder("float", [None, K], name="input_vector")
W_in = tf.Variable(tf.random_normal([K, N], 0.0, 0.05), trainable=False)
W_in_u = tf.matmul(u, W_in)
z = tf.concat(1, [x, u])
W_out = tf.Variable(tf.random_normal([K + N, L], 0.0, 0.05))
y = tf.matmul(z, W_out)
W_fb = tf.Variable(tf.random_normal([L, N], 0.0, 0.05), trainable=False)
W_fb_y = tf.matmul(y, W_fb)
x_next = tf.tanh(W_in_u + W_x + W_fb_y)
y_ = tf.placeholder("float", [None, L], name="train_output")
My problem is two-fold. First I don't know how to implement this as a superclass of RNNCell. Second I don't know how to generate a W tensor according to above specification.
Any help about any of these question is greatly appreciated. Maybe I can figure out a way to prepare W, but I sure as hell don't understand how to implement my own RNN as a superclass of RNNCell.
To give a quick summary:
Look in the TensorFlow source code under python/ops/rnn_cell.py too see how to subclass RNNCell. It's usually like this:
class MyRNNCell(RNNCell):
def __init__(...):
#property
def output_size(self):
...
#property
def state_size(self):
...
def __call__(self, input_, state, name=None):
... your per-step iteration here ...
I may arrive a bit late but for anybody trying to do the same, tensorflow-addons added an implementation of ESNCell and ESN layer.