I am implementing a simple neural network classifier for the iris dataset. The NN has 3 input nodes, 1 hidden layer with two nodes, and 3 output nodes. I have implemented evrything but the values of the partial derivatives are not calculated correctly. I have exhausted myself looking for the solution but couldn't.
Here is my code for calculating the partial derivatives.
def derivative_cost_function(self,X,Y,thetas):
'''
Computes the derivates of Cost function w.r.t input parameters (thetas)
for given input and labels.
Input:
------
X: can be either a single d X n-dimensional vector or d X n dimensional matrix of inputs
theata: must dk X 1-dimensional vector for representing vectors of k classes
Y: Must be k X n-dimensional label vector
Returns:
------
partial_thetas: a dk X 1-dimensional vector of partial derivatives of cost function w.r.t parameters..
'''
#forward pass
a2, a3=self.forward_pass(X,thetas)
#now back-propogate
# unroll thetas
l1theta, l2theta = self.unroll_thetas(thetas)
nexamples=float(X.shape[1])
# compute delta3, l2theta
a3 = np.array(a3)
a2 = np.array(a2)
Y = np.array(Y)
a3 = a3.T
delta3 = (a3 * (1 - a3)) * (((a3 - Y)/((a3)*(1-a3))))
l2Derivatives = np.dot(delta3, a2)
#print "Layer 2 derivatives shape = ", l2Derivatives.shape
#print "Layer 2 derivatives = ", l2Derivatives
# compute delta2, l1 theta
a2 = a2.T
dotProduct = np.dot(l2theta.T,delta3)
delta2 = dotProduct * (a2) * (1- a2)
l1Derivatives = np.dot(delta2[1:], X.T)
#print "Layer 1 derivatives shape = ", l1Derivatives.shape
#print "Layer 1 derivatives = ", l1Derivatives
#remember to exclude last element of delta2, representing the deltas of bias terms...
# i.e. delta2=delta2[:-1]
# roll thetas into a big vector
thetas=(self.roll_thetas(l1Derivatives,l2Derivatives)).reshape(thetas.shape) # return the same shape as you received
return thetas
Why not have a look of my implementation in https://github.com/zizhaozhang/simple_neutral_network/blob/master/nn.py
The derivatives is actually here:
def dCostFunction(self, theta, in_dim, hidden_dim, num_labels, X, y):
#compute gradient
t1, t2 = self.uncat(theta, in_dim, hidden_dim)
a1, z2, a2, z3, a3 = self._forward(X, t1, t2) # p x s matrix
# t1 = t1[1:, :] # remove bias term
# t2 = t2[1:, :]
sigma3 = -(y - a3) * self.dactivation(z3) # do not apply dsigmode here? should I
sigma2 = np.dot(t2, sigma3)
term = np.ones((1,num_labels))
sigma2 = sigma2 * np.concatenate((term, self.dactivation(z2)),axis=0)
theta2_grad = np.dot(sigma3, a2.T)
theta1_grad = np.dot(sigma2[1:,:], a1.T)
theta1_grad = theta1_grad / num_labels
theta2_grad = theta2_grad / num_labels
return self.cat(theta1_grad.T, theta2_grad.T)
Hope it helps
Related
import numpy as np
import pandas as pd
import numpy as np
from matplotlib import pyplot as pt
def computeCost(X,y,theta):
m=len(y)
predictions= X*theta-y
sqrerror=np.power(predictions,2)
return 1/(2*m)*np.sum(sqrerror)
def gradientDescent(X, y, theta, alpha, num_iters):
m = len(y)
jhistory = np.zeros((num_iters,1))
for i in range(num_iters):
h = X * theta
s = h - y
theta = theta - (alpha / m) * (s.T*X).T
jhistory_iter = computeCost(X, y, theta)
return theta,jhistory_iter
data = open(r'C:\Users\Coding\Desktop\machine-learning-ex1\ex1\ex1data1.txt')
data1=np.array(pd.read_csv(r'C:\Users\Coding\Desktop\machine-learning-ex1\ex1\ex1data1.txt',header=None))
y =np.array(data1[:,1])
m=len(y)
y=np.asmatrix(y.reshape(m,1))
X = np.array([data1[:,0]]).reshape(m,1)
X = np.asmatrix(np.insert(X,0,1,axis=1))
theta=np.zeros((2,1))
iterations = 1500
alpha = 0.01;
print('Testing the cost function ...')
J = computeCost(X, y, theta)
print('With theta = [0 , 0]\nCost computed = ', J)
print('Expected cost value (approx) 32.07')
theta=np.asmatrix([[-1,0],[1,2]])
J = computeCost(X, y, theta)
print('With theta = [-1 , 2]\nCost computed =', J)
print('Expected cost value (approx) 54.24')
theta,JJ = gradientDescent(X, y, theta, alpha, iterations)
print('Theta found by gradient descent:')
print(theta)
print('Expected theta values (approx)')
print(' -3.6303\n 1.1664\n')
predict1 = [1, 3.5] *theta
print(predict1*10000)
Result:
Testing the cost function ...
With theta = [0 , 0]
Cost computed = 32.072733877455676
Expected cost value (approx) 32.07
With theta = [-1 , 2]
Cost computed = 69.84811062494227
Expected cost value (approx) 54.24
Theta found by gradient descent:
[[-3.70304726 -3.64357517]
[ 1.17367146 1.16769684]]
Expected theta values (approx)
-3.6303
1.1664
[[4048.02858742 4433.63790186]]
There are two problems, the first Cost computed was right, but the second one was wrong. And there are 4 element in my gradient descent(suppose to be two)
When you mention "With theta = [-1 , 2]"
and you enter
theta=np.asmatrix([[-1,0],[1,2]])
I think this is incorrect. Assuming that you have single feature and you added a column of 1, and you are trying to do simple linear regression
The correct way should be
np.array([-1,2])
Also where have
predictions= X*theta-y
It would be better if you did
np.dot(X,theta)-y
When you multiply, it's not doing the same thing.
I had created a neural network with one hidden layer with 3 nodes, the sigmoid function was used as the activation function
The input layer (2 nodes):
X | X is a matrix with size (1x2)
Hidden Layer (3 nodes):
W1 => will have a size (3x2)
Z1 = X.(W1.T) + b
A1 = σ(Z) | (W1.T) is the transpose of matrix W
Output Layer (4 nodes):
W2 => will have a size (4x3)
Z2 = A1.(W2.T) + b
A2 = σ(Z)
E = 0.5*(Y- A2)^2 | Y is a matrix with size (1x4)
I wanted to calculate how W of hidden layer affects the E (cost function), applying the chain rule
dEdW1 = dEdA2 * dA2dZ2 * dZ2dA1 * dA1dZ1 * dZ1dW1
= (A2-Y) * (σ(Z2)*(1-σ(Z2))) * W2.T * (σ(Z1)*(1-σ(Z1))) * X
matrix sizes=> (1x4) * (1x4) * (3x4) * (1x3) * (1x2)
I know there is something wrong with my derivatives as I cannot multiply the matrices (they come in different sizes), can someone point out how to fix this
Here is the implementation:
class Sigmoid:
def sigmoid(self, Z):
return (1 / (1 + numpy.exp(-Z)))
# change in sigmoid w.r.t Z
def dsigmoiddZ(self, Z):
return self.sigmoid(Z) * (1 - self.sigmoid(Z))
class PrimaryLayer:
def __init__(self, node_count):
self.node_count = node_count
self.previous_layer = None
self.next_layer = None
self.Z = None
self.A = None
class SecondaryLayer(PrimaryLayer):
def __init__(self, node_count, previous_layer):
super().__init__(node_count)
self.dEdW = None
self.set_previous_layer(previous_layer)
self.set_next_layer_to_self()
self.set_W()
self.set_b()
self.set_Z()
self.set_A()
def set_previous_layer(self, previous_layer):
self.previous_layer = previous_layer
# This points the next_layer of previous_layer to self
def set_next_layer_to_self(self):
self.previous_layer.next_layer = self
# Randomly generate weights for this layer
def set_W(self):
self.W = numpy.random.random((self.node_count, self.previous_layer.node_count))
def set_b(self):
self.b = numpy.random.random()
def set_W_adjusted(self, learning_rate):
self.W_adjusted = W_adjusted
def set_Z(self):
previous_layer = self.previous_layer
if previous_layer.previous_layer is not None:
self.Z = numpy.dot(previous_layer.A, self.W.T) + self.b
else:
#if previous_layer.previos_layer = None then it is the Input Layer
self.Z = numpy.dot(previous_layer.X, self.W.T) + self.b
def set_A(self):
self.A = Sigmoid().sigmoid(self.Z)
def set_dAdZ(self):
self.dAdZ = self.A * (1 - self.A)
def set_dZdW(self):
self.dZdW = self.A
class InputLayer(PrimaryLayer):
def __init__(self, node_count, X):
super().__init__(node_count)
self.X = X
class OutputLayer(SecondaryLayer):
def __init__(self, node_count, previous_layer, Y):
super().__init__(node_count, previous_layer)
self.Y = Y
self.set_E()
def set_E(self):
self.E = ((1 / 2) * numpy.square(self.Y - self.A))
def set_dEdA(self):
self.dEdA = self.A - self.Y
def set_dEdW(self):
self.set_dEdA()
self.set_dAdZ()
self.set_dZdW()
self.dEdW = self.dEdA * self.dAdZ * self.dZdW
class HiddenLayer(SecondaryLayer):
def __init__(self, node_count, previous_layer):
super().__init__(node_count, previous_layer)
def set_dEdW(self):
self.set_dAdZ()
self.set_dZdW()
self.dEdW = self.dAdZ * self.dZdW * self.calculate_derivative(self.next_layer)
#calculate derivatives for other layer
def calculate_derivative(self, next_layer):
this_layer = next_layer
if this_layer.next_layer is None:
# the outputlayer
this_layer.set_dAdZ()
this_layer.set_dEdA()
dAdZ = this_layer.dAdZ
dEdA = this_layer.dEdA
return dAdZ * dEdA
elif this_layer is not None:
this_layer.set_dAdZ()
# How a layer's Z changes w.r.t previous layer's A
dZdA = this_layer.W.T
# How currrent layer's A changes w.r.t current layer's Z
dAdZ = this_layer.dAdZ
return dZdA * dAdZ * self.calculate_derivative(this_layer.next_layer)
class NeuralNetwork:
def __init__(self):
X = numpy.array([[1, 1]])
Y = numpy.array([[2, 2, 2, 2]])
# create layers and foward propagate
self.input_layer = InputLayer(2, X)
self.h1 = HiddenLayer(3, self.input_layer)
self.output_layer = OutputLayer(4, self.h1, Y)
self.forward_propagation(self.input_layer)
self.back_propagation(self.output_layer)
def forward_propagation(self, input_layer):
current_layer = input_layer.next_layer
if current_layer is not None:
current_layer.set_Z()
current_layer.set_A()
# if current layer is a outputlayer
if current_layer.next_layer is None:
current_layer.set_E()
self.forward_propagation(current_layer)
def back_propagation(self, output_layer):
current_layer = output_layer
# loop through all NON Input layers
if current_layer.previous_layer is not None:
current_layer.set_dEdW()
print(current_layer.dEdW)
self.back_propagation(current_layer.previous_layer)
nn = NeuralNetwork()
You should probably transpose your matrices.
Matrices multiplication work like "AxB * BxC = AxC", so you will not be able to multiply (1x4) * (3x4). Try transposing the second one, so you'll have a (1x4)*(4x3) multiplication.
And also reorder the rest, so it should look like (4x1)(1x4)(4x3)(3x1)(1x2)
Switch your weight matrices. Suppose the following: Given an input matrix X of dimensions 4x2 (that is you have 4 inputs - aka your rows - and each input has a dimensionality of 2 elements) for example:
[[0, 0], [0, 1], [1, 0], [1, 1]]
If your hidden layer has 4 neurons, your w0 weight matrix turns out to be of size 2x4. You'll have a b0 bias vector of 4 elements and upon calculating the activation on this layer, you'll end up with:
z1 = X # w0 + b0 # 4x2 * 2x4 + 1x4 = 4x4
a1 = f(z1) # 4x4
Then let's say in the output layer you have 2 elements which makes your w1 matrix of size 4x2 and your b1 vector of size 2. Doing the maths gives:
z2 = a1 # w1 + b1 # 4x4 * 4x2 + 1x2 = 4x2
a2 = f(z) # 4x2
During backpropagation you calculate a delta for each layer as:
d_O = (Y - T) * f'(z2)
d_H = d_O # W1.T * f'(z1)
where Y is your network's guess of size: 4x2
T are the training labels, so they must match up: 4x2
The multiplication is an element-wise multiplication, which makes your
d_O of size 4x2
Calculating the hidden layer delta dimension is pretty similar:
d_O: 4x2
W1 is of size 4x2, so the transpose is 2x4, with # being the standard matrix multiplication your d_H dimension turns out as:
d_H of size 4x4
For updating your weights then you have:
dW1 := a1.T # d_O
dW2 := X.T # d_H
where a1 is a 4x4 matrix, d_O is a 4x2 matrix, which makes your dW1 a 4x2 matrix. W1 was of size 4x2.
X is your input matrix of size 4x2, X.T is then 2x4, and d_H was 4x4, which makes dW0 a 2x4. Your original W0 was 2x4.
Bias deltas are lot simpler:
db1 = np.sum(d_O, axis=0)
db0 = np.sum(d_H, axis=0)
I'm trying to implement regularized logistic regression using python for the coursera ML class but I'm having a lot of trouble vectorizing it. Using this repository:
I've tried many different ways but never get the correct gradient or cost heres my current implementation:
h = utils.sigmoid( np.dot(X, theta) )
J = (-1/m) * ( y.T.dot( np.log(h) ) + (1 - y.T).dot( np.log( 1 - h ) ) ) + ( lambda_/(2*m) ) * np.sum( np.square(theta[1:]) )
grad = ((1/m) * (h - y).T.dot( X )).T + grad_theta_reg
Here are the results:
Cost : 0.693147
Expected
cost: 2.534819
Gradients:
[-0.100000, -0.030000, -0.080000, -0.130000]
Expected gradients:
[0.146561, -0.548558, 0.724722, 1.398003]
Any help from someone who knows whats going on would be much appreciated.
Bellow a working snippet of a vectorized version of Logistic Regression. You can see more here https://github.com/hzitoun/coursera_machine_learning_matlab_python
Main
theta_t = np.array([[-2], [-1], [1], [2]])
data = np.arange(1, 16).reshape(3, 5).T
X_t = np.c_[np.ones((5,1)), data/10]
y_t = (np.array([[1], [0], [1], [0], [1]]) >= 0.5) * 1
lambda_t = 3
J, grad = lrCostFunction(theta_t, X_t, y_t, lambda_t), lrGradient(theta_t, X_t, y_t, lambda_t, flattenResult=False)
print('\nCost: f\n', J)
print('Expected cost: 2.534819\n')
print('Gradients:\n')
print(' f \n', grad)
print('Expected gradients:\n')
print(' 0.146561\n -0.548558\n 0.724722\n 1.398003\n')
lrCostFunction
from sigmoid import sigmoid
import numpy as np
def lrCostFunction(theta, X, y, reg_lambda):
"""LRCOSTFUNCTION Compute cost and gradient for logistic regression with
regularization
J = LRCOSTFUNCTION(theta, X, y, lambda) computes the cost of using
theta as the parameter for regularized logistic regression and the
gradient of the cost w.r.t. to the parameters.
"""
m, n = X.shape #number of training examples
theta = theta.reshape((n,1))
prediction = sigmoid(X.dot(theta))
cost_y_1 = (1 - y) * np.log(1 - prediction)
cost_y_0 = -1 * y * np.log(prediction)
J = (1.0/m) * np.sum(cost_y_0 - cost_y_1) + (reg_lambda/(2.0 * m)) * np.sum(np.power(theta[1:], 2))
return J
lrGradient
from sigmoid import sigmoid
import numpy as np
def lrGradient(theta, X,y, reg_lambda, flattenResult=True):
m,n = X.shape
theta = theta.reshape((n,1))
prediction = sigmoid(np.dot(X, theta))
errors = np.subtract(prediction, y)
grad = (1.0/m) * np.dot(X.T, errors)
grad_with_regul = grad[1:] + (reg_lambda/m) * theta[1:]
firstRow = grad[0, :].reshape((1,1))
grad = np.r_[firstRow, grad_with_regul]
if flattenResult:
return grad.flatten()
return grad
Hope that helped!
I'm implementing the analytical form of this function
where k(x,y) is a RBF kernel k(x,y) = exp(-||x-y||^2 / (2h))
My function prototype is
def A(X, Y, grad_log_px,Kxy):
pass
and X, Y are NxD matrix where N is batch size and D is a dimension. So X is a batch of x with size N in the above equation grad_log_px is some NxD matrix I've computed using autograd.
Kxy is NxN matrix where each entry (i,j) is the RBF kernel K(X[i],Y[j])
The challenge here is that in the above equation, y is just a vector with dimension D. I kind of want to pass into a batch of y. (So to pass matrix Y with NxD size)
The equation is fine using loop through the batch size but I'm having trouble to implement in a more neat way
here is my attempted loop solution:
def A(X, Y, grad_log_px,Kxy):
res = []
for i in range(Y.shape[0]):
temp = 0
for j in range(X.shape[0]):
# first term of equation
temp += grad_log_px[j].reshape(D,1)#(Kxy[j,i] * (X[i] - Y[j]) / h).reshape(1,D)
temp += Kxy[j,i] * np.identity(D) - ((X[i] - Y[j]) / h).reshape(D,1)#(Kxy[j,i] * (X[i] - Y[j]) / h).reshape(1,D) # second term of equation
temp /= X.shape[0]
res.append(temp)
return np.asarray(res) # return NxDxD array
In the equation: grad_{x} and grad_{y} both dimension D
Given that that I inferred all the dimensions of the various terms correctly here's a way to go about it. But first a summary of the dimensions (screenshot as it's easier to explain with math type setting; please verify if they are correct):
Also note the double derivative of the second term which gives:
where subscripts denote samples and superscripts denote features.
So we can create the two terms by using np.einsum (similarly torch.einsum) and array broadcasting:
grad_y_K = (X[:, None, :] - Y) / h * K[:, :, None] # Shape: N_x, N_y, D
term_1 = np.einsum('ij,ikl->ikjl', grad_log_px, grad_y_K) # Shape: N_x, N_y, D_x, D_y
term_2_h = np.einsum('ij,kl->ijkl', K, np.eye(D)) / h # Shape: N_x, N_y, D_x, D_y
term_2_h2_xy = np.einsum('ijk,ijl->ijkl', grad_y_K, grad_y_K) # Shape: N_x, N_y, D_x, D_y
term_2_h2 = K[:, :, None, None] * term_2_h2_xy / h**2 # Shape: N_x, N_y, D_x, D_y
term_2 = term_2_h - term_2_h2 # Shape: N_x, N_y, D_x, D_y
Then the result is given by:
(term_1 + term_2).sum(axis=0) / N # Shape: N_y, D_x, D_y
import numpy as np
from numpy import exp
from numpy import random
from numpy import log
from numpy import size
from numpy import amax
from numpy import multiply as mul
from numpy import vstack
from matplotlib import pyplot as plot
def sigmoid(x):
return 1.0 / (1+ exp(-x) )
def backPro(X,y,alpha):
a1=X.T
n1=size(X,1)
n2=10
m=size(a1,1)
a1 = vstack((np.ones((1,m)), a1))
K=size(y,0)
Jlist=[]
s=100
lit=0
#initialization (radonize)
theta1=random.rand(n2,n1+1)
theta2=random.rand(K,n2+1)
while lit<10000:
z2 = theta1 * a1
a2 = sigmoid(z2)
a2 = vstack((np.ones((1,m)), a2))
z3 = theta2 * a2
H = sigmoid(z3)
J = (-1.0/m) * np.sum(mul(y,log(H)) + mul(1-y,log(1-H)))
Jlist.append(J)
sigma3 = H-y
sigma2 = mul( (theta2.T * sigma3), mul(a2 , (1-a2)) )
delta1 = sigma2[1:] * a1.T * (1.0/m)
delta2 = sigma3 * a2.T * (1.0/m)
#do the gradient descent
theta1 -= (delta1 * alpha)
theta2 -= (delta2 * alpha)
#update the s
#s=max( amax(np.abs(delta1)),amax(np.abs(delta2)) )
lit+=1
plot.scatter(range(0,len(Jlist)),Jlist,1)
plot.scatter(range(0,len(Jlist[-100:])), Jlist[-100:],1)
print "The J is "+str(J)
print "The S is "+str(s)
print "lit is "+str(lit)
return theta1,theta2
def cost(a1,y,h):
m=size(a1,0)
Kmatrix=np.sum( mul(y, log(h)) + mul((1-y), log(1-h)) )
J= (-1.0/m)* Kmatrix
return J
def predict(X, theta1, theta2):
X = X.T
m = size(X,1)
X = vstack((np.ones((1,m)),X))
hidden = sigmoid(theta1 * X)
hidden = vstack((np.ones((1,m)),hidden))
print sigmoid(theta2 * hidden)
return np.argmax( sigmoid(theta2 * hidden),0 ).T + 1
I want to use this multiclassifier to classify the handwritten digits which has already been transferred as numbers (the original images are just simply binary pixel which means that 1 represents a pixel is pure black otherwise pure white)
In practice, X stores the data in a form of numpy.matrix and has a dimention of n1 * m (row * collon), here n1 represents the number of input layer Nodes, and m is the number of data sets, and y stores the labels (results from 1 to 9), also in a form of numpy.matrix, and has a dimention of K * m (K is the number of labels which is 9 here, and m is also the number of dataset). By the way, I have cancelled the calculation for s in backPro.
when I put X,y,and alpha (learning rate) into the backPro and get the theta1 and theta2, I then use these two parameters calculate the output using the training set (Yes, it is just the training set), but find that the predictions for all the datasets are the same, here I mean exactly the same, I find it incredible..
and here is result:
from simpleneu import predict as pre
pre(X,theta1,theta2)
[[0.10106717 0.10106717 0.10106717 ... 0.10106717 0.10106717 0.10106717]
[0.10169492 0.10169492 0.10169492 ... 0.10169492 0.10169492 0.10169492]
[0.09981168 0.09981168 0.09981168 ... 0.09981168 0.09981168 0.09981168]
...
[0.09918393 0.09918393 0.09918393 ... 0.09918393 0.09918393 0.09918393]
[0.09730069 0.09730069 0.09730069 ... 0.09730069 0.09730069 0.09730069]
[0.09918393 0.09918393 0.09918393 ... 0.09918393 0.09918393 0.09918393]]
Out[99]:
matrix([[2],
[2],
[2],
...,
[2],
[2],
[2]])
and the dimention of the output is K * m, and in the predict() function I print out the sigmoid(theta2 * hidden) before transferring them as the labels of ones with the largest potential
and here is the graph for J which is the cost with respect to iteration times: