Debugging a Neural Network

Debugging a Neural Network - python

TLDR
I have been trying to fit a simple neural network on MNIST, and it works for a small debugging setup, but when I bring it over to a subset of MNIST, it trains super fast and the gradient is close to 0 very quickly, but then it outputs the same value for any given input and the final cost is quite high. I had been trying to purposefully overfit to make sure it is in fact working but it will not do so on MNIST suggesting a deep problem in the setup. I have checked my backpropagation implementation using gradient checking and it seems to match up, so not sure where the error lies, or what to work on now!
Many thanks for any help you can offer, I've been struggling to fix this!
Explanation
I have been trying to make a neural network in Numpy, based on this explanation:
http://ufldl.stanford.edu/wiki/index.php/Neural_Networks
http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm
Backpropagation seems to match gradient checking:
Backpropagation: [ 0.01168585, 0.06629858, -0.00112408, -0.00642625, -0.01339408,
-0.07580145, 0.00285868, 0.01628148, 0.00365659, 0.0208475 ,
0.11194151, 0.16696139, 0.10999967, 0.13873069, 0.13049299,
-0.09012582, -0.1344335 , -0.08857648, -0.11168955, -0.10506167]
Gradient Checking: [-0.01168585 -0.06629858 0.00112408 0.00642625 0.01339408
0.07580145 -0.00285868 -0.01628148 -0.00365659 -0.0208475
-0.11194151 -0.16696139 -0.10999967 -0.13873069 -0.13049299
0.09012582 0.1344335 0.08857648 0.11168955 0.10506167]
And when I train on this simple debug setup:
a is a neural net w/ 2 inputs -> 5 hidden -> 2 outputs, and learning rate 0.5
a.gradDesc(np.array([[0.1,0.9],[0.2,0.8]]),np.array([[0,1],[0,1]]))
ie. x1 = [0.1, 0.9] and y1 = [0,1]
I get these lovely training curves
Admittedly this is clearly a dumbed down, very easy function to fit.
However as soon as I bring it over to MNIST, with this setup:
# Number of input, hidden and ouput nodes
# Input = 28 x 28 pixels
input_nodes=784
# Arbitrary number of hidden nodes, experiment to improve
hidden_nodes=200
# Output = one of the digits [0,1,2,3,4,5,6,7,8,9]
output_nodes=10
# Learning rate
learning_rate=0.4
# Regularisation parameter
lambd=0.0
With this setup run on the code below, for 100 iterations, it does seem to train at first then just "flat lines" quite quickly and doesnt achieve a very good model:
Initial ===== Cost (unregularised): 2.09203670985 /// Cost (regularised): 2.09203670985 Mean Gradient: 0.0321241229793
Iteration 100 Cost (unregularised): 0.980999805477 /// Cost (regularised): 0.980999805477 Mean Gradient: -5.29639499854e-09
TRAINED IN 26.45932364463806
This then gives really poor test accuracy and predicts the same output, even when tested with all inputs being 0.1 or all 0.9 I just get the same output (although precisely which number it outputs varies depending on initial random weights):
Test accuracy: 8.92
Targets 2 2 1 7 2 2 0 2 3
Hypothesis 5 5 5 5 5 5 5 5 5
And the curves for the MNIST Training:
Code dump:
# Import dependencies
import numpy as np
import time
import csv
import matplotlib.pyplot
import random
import math
# Read in training data
with open('MNIST/mnist_train_100.csv') as file:
train_data=np.array([list(map(int,line.strip().split(','))) for line in file.readlines()])
# In[197]:
# Plot a sample of training data to visualise
displayData(train_data[:,1:], 25)
# In[198]:
# Read in test data
with open('MNIST/mnist_test.csv') as file:
test_data=np.array([list(map(int,line.strip().split(','))) for line in file.readlines()])
# Main neural network class
class neuralNetwork:
# Define the architecture
def __init__(self, i, h, o, lr, lda):
# Number of nodes in each layer
self.i=i
self.h=h
self.o=o
# Learning rate
self.lr=lr
# Lambda for regularisation
self.lda=lda
# Randomly initialise the parameters, input-> hidden and hidden-> output
self.ih=np.random.normal(0.0,pow(self.h,-0.5),(self.h,self.i))
self.ho=np.random.normal(0.0,pow(self.o,-0.5),(self.o,self.h))
def predict(self, X):
# GET HYPOTHESIS ESTIMATES/ OUTPUTS
# Add bias node x(0)=1 for all training examples, X is now m x n+1
# Then compute activation to hidden node
z2=np.dot(X,self.ih.T) + 1
#print(a1.shape)
a2=sigmoid(z2)
#print(ha)
# Add bias node h(0)=1 for all training examples, H is now m x h+1
# Then compute activation to output node
z3=np.dot(a2,self.ho.T) + 1
h=sigmoid(z3)
outputs=np.argmax(h.T,axis=0)
return outputs
def backprop (self, X, y):
try:
m = X.shape[0]
except:
m=1
# GET HYPOTHESIS ESTIMATES/ OUTPUTS
# Add bias node x(0)=1 for all training examples, X is now m x n+1
# Then compute activation to hidden node
z2=np.dot(X,self.ih.T)
#print(a1.shape)
a2=sigmoid(z2)
#print(ha)
# Add bias node h(0)=1 for all training examples, H is now m x h+1
# Then compute activation to output node
z3=np.dot(a2,self.ho.T)
h=sigmoid(z3)
# Compute error/ cost for this setup (unregularised and regularise)
costReg=self.costFunc(h,y)
costUn=self.costFuncReg(h,y)
# Output error term
d3=-(y-h)*sigmoidGradient(z3)
# Hidden error term
d2=np.dot(d3,self.ho)*sigmoidGradient(z2)
# Partial derivatives for weights
D2=np.dot(d3.T,a2)
D1=np.dot(d2.T,X)
# Partial derivatives of theta with regularisation
T2Grad=(D2/m)+(self.lda/m)*(self.ho)
T1Grad=(D1/m)+(self.lda/m)*(self.ih)
# Update weights
# Hidden layer (weights 1)
self.ih-=self.lr*(((D1)/m) + (self.lda/m)*self.ih)
# Output layer (weights 2)
self.ho-=self.lr*(((D2)/m) + (self.lda/m)*self.ho)
# Unroll gradients to one long vector
grad=np.concatenate(((T1Grad).ravel(),(T2Grad).ravel()))
return costReg, costUn, grad
def backpropIter (self, X, y):
try:
m = X.shape[0]
except:
m=1
# GET HYPOTHESIS ESTIMATES/ OUTPUTS
# Add bias node x(0)=1 for all training examples, X is now m x n+1
# Then compute activation to hidden node
z2=np.dot(X,self.ih.T)
#print(a1.shape)
a2=sigmoid(z2)
#print(ha)
# Add bias node h(0)=1 for all training examples, H is now m x h+1
# Then compute activation to output node
z3=np.dot(a2,self.ho.T)
h=sigmoid(z3)
# Compute error/ cost for this setup (unregularised and regularise)
costUn=self.costFunc(h,y)
costReg=self.costFuncReg(h,y)
gradW1=np.zeros(self.ih.shape)
gradW2=np.zeros(self.ho.shape)
for i in range(m):
delta3 = -(y[i,:]-h[i,:])*sigmoidGradient(z3[i,:])
delta2 = np.dot(self.ho.T,delta3)*sigmoidGradient(z2[i,:])
gradW2= gradW2 + np.outer(delta3,a2[i,:])
gradW1 = gradW1 + np.outer(delta2,X[i,:])
# Update weights
# Hidden layer (weights 1)
#self.ih-=self.lr*(((gradW1)/m) + (self.lda/m)*self.ih)
# Output layer (weights 2)
#self.ho-=self.lr*(((gradW2)/m) + (self.lda/m)*self.ho)
# Unroll gradients to one long vector
grad=np.concatenate(((gradW1).ravel(),(gradW2).ravel()))
return costUn, costReg, grad
def gradDesc(self, X, y):
# Backpropagate to get updates
cost,costreg,grad=self.backpropIter(X,y)
# Unroll parameters
deltaW1=np.reshape(grad[0:self.h*self.i],(self.h,self.i))
deltaW2=np.reshape(grad[self.h*self.i:],(self.o,self.h))
# m = no. training examples
m=X.shape[0]
#print (self.ih)
self.ih -= self.lr * ((deltaW1))#/m) + (self.lda * self.ih))
self.ho -= self.lr * ((deltaW2))#/m) + (self.lda * self.ho))
#print(deltaW1)
#print(self.ih)
return cost,costreg,grad
# Gradient checking to compute the gradient numerically to debug backpropagation
def gradCheck(self, X, y):
# Unroll theta
theta=np.concatenate(((self.ih).ravel(),(self.ho).ravel()))
# perturb will add and subtract epsilon, numgrad will store answers
perturb=np.zeros(len(theta))
numgrad=np.zeros(len(theta))
# epsilon, e is a small number
e = 0.00001
# Loop over all theta
for i in range(len(theta)):
# Perturb is zeros with one index being e
perturb[i]=e
loss1=self.costFuncGradientCheck(theta-perturb, X, y)
loss2=self.costFuncGradientCheck(theta+perturb, X, y)
# Compute numerical gradient and update vectors
numgrad[i]=(loss1-loss2)/(2*e)
perturb[i]=0
return numgrad
def costFuncGradientCheck(self,theta,X,y):
T1=np.reshape(theta[0:self.h*self.i],(self.h,self.i))
T2=np.reshape(theta[self.h*self.i:],(self.o,self.h))
m=X.shape[0]
# GET HYPOTHESIS ESTIMATES/ OUTPUTS
# Compute activation to hidden node
z2=np.dot(X,T1.T)
a2=sigmoid(z2)
# Compute activation to output node
z3=np.dot(a2,T2.T)
h=sigmoid(z3)
cost=self.costFunc(h, y)
return cost #+ ((self.lda/2)*(np.sum(pow(T1,2)) + np.sum(pow(T2,2))))
def costFunc(self, h, y):
m=h.shape[0]
return np.sum(pow((h-y),2))/m
def costFuncReg(self, h, y):
cost=self.costFunc(h, y)
return cost #+ ((self.lda/2)*(np.sum(pow(self.ih,2)) + np.sum(pow(self.ho,2))))
# Helper functions to compute sigmoid and gradient for an input number or matrix
def sigmoid(Z):
return np.divide(1,np.add(1,np.exp(-Z)))
def sigmoidGradient(Z):
return sigmoid(Z)*(1-sigmoid(Z))
# Pre=processing helper functions
# Normalise data to 0.1-1 as 0 inputs kills the weights and changes
def scaleDataVec(data):
return (np.asfarray(data[1:]) / 255.0 * 0.99) + 0.1
def scaleData(data):
return (np.asfarray(data[:,1:]) / 255.0 * 0.99) + 0.1
# DISPLAY DATA
# plot_data will be what to plot, num_ex must be a square number of how many examples to plot, random examples will then be plotted
def displayData(plot_data, num_ex, rand=1):
if rand==0:
data=plot_data
else:
rand_indexes=random.sample(range(plot_data.shape[0]),num_ex)
data=plot_data[rand_indexes,:]
# Useful variables, m= no. train ex, n= no. features
m=data.shape[0]
n=data.shape[1]
# Shape for one example
example_width=math.ceil(math.sqrt(n))
example_height=math.ceil(n/example_width)
# No. of items to display
display_rows=math.floor(math.sqrt(m))
display_cols=math.ceil(m/display_rows)
# Padding between images
pad=1
# Setup blank display
display_array = -np.ones((pad + display_rows * (example_height + pad), (pad + display_cols * (example_width + pad))))
curr_ex=0
for i in range(1,display_rows+1):
for j in range(1,display_cols+1):
if curr_ex>m:
break
# Max value of this patch
max_val=max(abs(data[curr_ex, :]))
display_array[pad + (j-1) * (example_height + pad) : j*(example_height+1), pad + (i-1) * (example_width + pad) : i*(example_width+1)] = data[curr_ex, :].reshape(example_height, example_width)/max_val
curr_ex+=1
matplotlib.pyplot.imshow(display_array, cmap='Greys', interpolation='None')
# In[312]:
a=neuralNetwork(2,5,2,0.5,0.0)
print(a.backpropIter(np.array([[0.1,0.9],[0.2,0.8]]),np.array([[0,1],[0,1]])))
print(a.gradCheck(np.array([[0.1,0.9],[0.2,0.8]]),np.array([[0,1],[0,1]])))
D=[]
C=[]
for i in range(100):
c,b,d=a.gradDesc(np.array([[0.1,0.9],[0.2,0.8]]),np.array([[0,1],[0,1]]))
C.append(c)
D.append(np.mean(d))
#print(c)
print(a.predict(np.array([[0.1,0.9]])))
# Debugging plot
matplotlib.pyplot.figure()
matplotlib.pyplot.plot(C)
matplotlib.pyplot.ylabel("Error")
matplotlib.pyplot.xlabel("Iterations")
matplotlib.pyplot.figure()
matplotlib.pyplot.plot(D)
matplotlib.pyplot.ylabel("Gradient")
matplotlib.pyplot.xlabel("Iterations")
#print(J)
# In[313]:
# Class instance
# Number of input, hidden and ouput nodes
# Input = 28 x 28 pixels
input_nodes=784
# Arbitrary number of hidden nodes, experiment to improve
hidden_nodes=200
# Output = one of the digits [0,1,2,3,4,5,6,7,8,9]
output_nodes=10
# Learning rate
learning_rate=0.4
# Regularisation parameter
lambd=0.0
# Create instance of Nnet class
nn=neuralNetwork(input_nodes,hidden_nodes,output_nodes,learning_rate,lambd)
# In[314]:
time1=time.time()
# Scale inputs
inputs=scaleData(train_data)
# 0.01-0.99 range as the sigmoid function can't reach 0 or 1, 0.01 for all except 0.99 for target
targets=(np.identity(output_nodes)*0.98)[train_data[:,0],:]+0.01
J=[]
JR=[]
Grad=[]
iterations=100
for i in range(iterations):
j,jr,grad=nn.gradDesc(inputs, targets)
grad=np.mean(grad)
if i == 0:
print("Initial ===== Cost (unregularised): ", j, "\t///", "Cost (regularised): ",jr," Mean Gradient: ",grad)
print("\r", end="")
print("Iteration ", i+1, "\tCost (unregularised): ", j, "\t///", "Cost (regularised): ", jr," Mean Gradient: ",grad,end="")
J.append(j)
JR.append(jr)
Grad.append(grad)
time2 = time.time()
print ("\nTRAINED IN ",time2-time1)
# In[315]:
# Debugging plot
matplotlib.pyplot.figure()
matplotlib.pyplot.plot(J)
matplotlib.pyplot.plot(JR)
matplotlib.pyplot.ylabel("Error")
matplotlib.pyplot.xlabel("Iterations")
matplotlib.pyplot.figure()
matplotlib.pyplot.plot(Grad)
matplotlib.pyplot.ylabel("Gradient")
matplotlib.pyplot.xlabel("Iterations")
#print(J)
# In[316]:
# Scale inputs
inputs=scaleData(test_data)
# 0.01-0.99 range as the sigmoid function can't reach 0 or 1, 0.01 for all except 0.99 for target
targets=test_data[:,0]
h=nn.predict(inputs)
score=[]
targ=[]
hyp=[]
for i,line in enumerate(targets):
if line == h[i]:
score.append(1)
else:
score.append(0)
hyp.append(h[i])
targ.append(line)
print("Test accuracy: ", sum(score)/len(score)*100)
indexes=random.sample(range(len(hyp)),9)
print("Targets ",end="")
for j in indexes:
print (targ[j]," ",end="")
print("\nHypothesis ",end="")
for j in indexes:
print (hyp[j]," ",end="")
displayData(test_data[indexes, 1:], 9, rand=0)
# In[277]:
nn.predict(0.9*np.ones((784,)))
Edit 1
Suggested to use different learning rates, but unfortunately, they all come out with similar results, here are the plots for 30 iterations, using the MNIST 100 subset:
Concretely, here are the figures that they start and end with:
Initial ===== Cost (unregularised): 4.07208963507 /// Cost (regularised): 4.07208963507 Mean Gradient: 0.0540251381858
Iteration 50 Cost (unregularised): 0.613310215166 /// Cost (regularised): 0.613310215166 Mean Gradient: -0.000133981500849Initial ===== Cost (unregularised): 5.67535252616 /// Cost (regularised): 5.67535252616 Mean Gradient: 0.0644797515914
Iteration 50 Cost (unregularised): 0.381080434935 /// Cost (regularised): 0.381080434935 Mean Gradient: 0.000427866902699Initial ===== Cost (unregularised): 3.54658422176 /// Cost (regularised): 3.54658422176 Mean Gradient: 0.0672211732868
Iteration 50 Cost (unregularised): 0.981 /// Cost (regularised): 0.981 Mean Gradient: 2.34515341943e-20Initial ===== Cost (unregularised): 4.05269658215 /// Cost (regularised): 4.05269658215 Mean Gradient: 0.0469666696193
Iteration 50 Cost (unregularised): 0.980999999999 /// Cost (regularised): 0.980999999999 Mean Gradient: -1.0582706063e-14Initial ===== Cost (unregularised): 2.40881492228 /// Cost (regularised): 2.40881492228 Mean Gradient: 0.0516056901574
Iteration 50 Cost (unregularised): 1.74539997258 /// Cost (regularised): 1.74539997258 Mean Gradient: 1.01955789614e-09Initial ===== Cost (unregularised): 2.58498876008 /// Cost (regularised): 2.58498876008 Mean Gradient: 0.0388768685257
Iteration 3 Cost (unregularised): 1.72520399313 /// Cost (regularised): 1.72520399313 Mean Gradient: 0.0134040908157
Iteration 50 Cost (unregularised): 0.981 /// Cost (regularised): 0.981 Mean Gradient: -4.49319474346e-43Initial ===== Cost (unregularised): 4.40141352357 /// Cost (regularised): 4.40141352357 Mean Gradient: 0.0689167742968
Iteration 50 Cost (unregularised): 0.981 /// Cost (regularised): 0.981 Mean Gradient: -1.01563966458e-22
A learning rate of 0.01, quite low, has the best outcome, but exploring learning rates in this region, I only came out with 30-40% accuracy, a big improvement on the 8% or even 0% that I had seen previously, but not really what it should be achieving!
Edit 2
I've now finished and added a backpropagation function optimized for matrices rather than the iterative formula, and so now I can run it on large epochs/ iterations without painfully slow. So the "backprop" function of the class matches with gradient check (in fact it is 1/2 the size but I think that is a problem in gradient check, so we'll leave that bc it should not matter proportionally and I have tried with adding in divisions to solve this). With large numbers of epochs I achieved a much better accuracy, but still there seems to be a problem, as when I have previously programmed a slightly different style of simple 3 layer neural network a part of a book, on the same dataset csvs, I get a much better training result. Here are some plots and data for large epochs.
Looks good but, we still have a pretty poor test set accuracy, and this is for 2,500 runs through the dataset, should be getting a good result with much less!
Test accuracy: 61.150000000000006
Targets 6 9 8 2 2 2 4 3 8
Hypothesis 6 9 8 4 7 1 4 3 8
Edit 3, what dataset?
http://makeyourownneuralnetwork.blogspot.co.uk/2015/03/the-mnist-dataset-of-handwitten-digits.html?m=1
Used train.csv and test.csv to try with more data and no better just takes longer so been using the subset train_100 and test_10 while I debug.
Edit 4
Seems to learn something after a very large number of epochs (like 14,000), as the whole dataset is used in the backprop function (not backpropiter) each loop is effectively an epoch, and with a ridiculous amount of epochs on the subset of 100 train and 10 test samples, the test accuracy is quite good. However with this small a sample this could easily be due to just chance and even then it's only 70% percent not what you'd be aiming for even on the small dataset. But it does show that it seems to be learning, I am trying parameters very extensively to rule that out.

Solved
I solved my neural network. A brief description follows in case it helps anyone else. Thanks to all those that helped with suggestions.
Basically, I had implemented it with a fully matrix approach ie. the backpropagation uses all examples each time. I later tried implementing it as a vector approach ie. backpropagation with each example. This was when I realised that the matrix approach doesn't update the parameters each example, so one run through this way is NOT the same as one run through each example in turn, effectively the whole training set is backpropagated as one example. Hence, my matrix implementation does work, but after many iterations, which then ends up taking longer than the vector approach anyway! Have opened a new question to learn more about this specific part but there we go, it just needed a lot of iterations with the matrix approach or a more gradual example by example approach.

You can "debug" your neural network using Tensorleap. It's a neural network debugging platform which uses some explainability algorithms. It allows you to upload your model and dataset and get a lot of information about your trained model.
MNIST is already in as a demo project.
They have a free trial for 14 days as I know.

Related

Gradient descent using TensorFlow is much slower than a basic Python implementation, why?

I'm following a machine learning course. I have a simple linear regression (LR) problem to help me get used to TensorFlow. The LR problem is to find parameters a and b such that Y = a*X + b approximates an (x, y) point cloud (which I generated myself for the sake of simplicity).
I am solving this LR problem using a 'fixed step size gradient descent (FSSGD)'. I implemented it using TensorFlow and it works but I noticed that it is really slow both on GPU and CPU. Because I was curious I implemented the FSSGD myself in Python/NumPy and as expected this runs much faster, about:
10x faster than TF#CPU
20x faster than TF#GPU
If TensorFlow is this slow, I cannot imagine that so many people are using this framework. So I must be doing something wrong. Can anyone help me so I can speedup my TensorFlow implementation.
I'm NOT interested in the difference between the CPU and GPU performance. Both performance indicators are merely provided for completeness and illustration. I'm interested in why my TensorFlow implementation is so much slower than a raw Python/NumPy implementation.
As reference, I add my code below.
Stripped to a minimal (but fully working) example.
Using Python v3.7.9 x64.
Used tensorflow-gpu==1.15 for now (because the course uses TensorFlow v1)
Tested to run in both Spyder and PyCharm.
My FSSGD implementation using TensorFlow (execution time about 40 sec #CPU to 80 sec #GPU):
#%% General imports
import numpy as np
import timeit
import tensorflow.compat.v1 as tf
#%% Get input data
# Generate simulated input data
x_data_input = np.arange(100, step=0.1)
y_data_input = x_data_input + 20 * np.sin(x_data_input/10) + 15
#%% Define tensorflow model
# Define data size
n_samples = x_data_input.shape[0]
# Tensorflow is finicky about shapes, so resize
x_data = np.reshape(x_data_input, (n_samples, 1))
y_data = np.reshape(y_data_input, (n_samples, 1))
# Define placeholders for input
X = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_x_data")
Y = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_y_data")
# Define variables to be learned
with tf.variable_scope("linear-regression", reuse=tf.AUTO_REUSE): #reuse= True | False | tf.AUTO_REUSE
W = tf.get_variable("weights", (1, 1), initializer=tf.constant_initializer(0.0))
b = tf.get_variable("bias", (1,), initializer=tf.constant_initializer(0.0))
# Define loss function
Y_pred = tf.matmul(X, W) + b
loss = tf.reduce_sum((Y - Y_pred) ** 2 / n_samples) # Quadratic loss function
# %% Solve tensorflow model
#Define algorithm parameters
total_iterations = 1e5 # Defines total training iterations
#Construct TensorFlow optimizer
with tf.variable_scope("linear-regression", reuse=tf.AUTO_REUSE): #reuse= True | False | tf.AUTO_REUSE
opt = tf.train.GradientDescentOptimizer(learning_rate = 1e-4)
opt_operation = opt.minimize(loss, name="GDO")
#To measure execution time
time_start = timeit.default_timer()
with tf.Session() as sess:
#Initialize variables
sess.run(tf.global_variables_initializer())
#Train variables
for index in range(int(total_iterations)):
_, loss_val_tmp = sess.run([opt_operation, loss], feed_dict={X: x_data, Y: y_data})
#Get final values of variables
W_val, b_val, loss_val = sess.run([W, b, loss], feed_dict={X: x_data, Y: y_data})
#Print execution time
time_end = timeit.default_timer()
print('')
print("Time to execute code: {0:0.9f} sec.".format(time_end - time_start))
print('')
# %% Print results
print('')
print('Iteration = {0:0.3f}'.format(total_iterations))
print('W_val = {0:0.3f}'.format(W_val[0,0]))
print('b_val = {0:0.3f}'.format(b_val[0]))
print('')
My own python FSSGD implementation (execution time about 4 sec):
#%% General imports
import numpy as np
import timeit
#%% Get input data
# Define input data
x_data_input = np.arange(100, step=0.1)
y_data_input = x_data_input + 20 * np.sin(x_data_input/10) + 15
#%% Define Gradient Descent (GD) model
# Define data size
n_samples = x_data_input.shape[0]
#Initialize data
W = 0.0 # Initial condition
b = 0.0 # Initial condition
# Compute initial loss
y_gd_approx = W*x_data_input+b
loss = np.sum((y_data_input - y_gd_approx)**2)/n_samples # Quadratic loss function
#%% Execute Gradient Descent algorithm
#Define algorithm parameters
total_iterations = 1e5 # Defines total training iterations
GD_stepsize = 1e-4 # Gradient Descent fixed step size
#To measure execution time
time_start = timeit.default_timer()
for index in range(int(total_iterations)):
#Compute gradient (derived manually for the quadratic cost function)
loss_gradient_W = 2.0/n_samples*np.sum(-x_data_input*(y_data_input - y_gd_approx))
loss_gradient_b = 2.0/n_samples*np.sum(-1*(y_data_input - y_gd_approx))
#Update trainable variables using fixed step size gradient descent
W = W - GD_stepsize * loss_gradient_W
b = b - GD_stepsize * loss_gradient_b
#Compute loss
y_gd_approx = W*x_data_input+b
loss = np.sum((y_data_input - y_gd_approx)**2)/x_data_input.shape[0]
#Print execution time
time_end = timeit.default_timer()
print('')
print("Time to execute code: {0:0.9f} sec.".format(time_end - time_start))
print('')
# %% Print results
print('')
print('Iteration = {0:0.3f}'.format(total_iterations))
print('W_val = {0:0.3f}'.format(W))
print('b_val = {0:0.3f}'.format(b))
print('')

I think it's the result of big iteration number. I've changed the iteration number from 1e5 to 1e3 and also changed x from x_data_input = np.arange(100, step=0.1) to x_data_input = np.arange(100, step=0.0001). This way I've reduced the iteration number but increased the computation by 10x. With np it's done in 22 sec and in tensorflow it's done in 25 sec.
My guess: tensorflow has alot of overhead in each iteration (to give us a framework that can do a lot) but the forward pass and backward pass speed are ok.

The actual answer to my question is hidden in the various comments. For future readers, I will summarize these findings in this answer.
About the speed difference between TensorFlow and a raw Python/NumPy implementation
This part of the answer is actually quite logically.
Each iteration (= each call of Session.run()) TensorFlow performs computations. TensorFlow has a large overhead for starting each computation. On GPU, this overhead is even worse than on CPU. However, TensorFlow executes the actual computations very efficient and more efficiently than the above raw Python/NumPy implementation does.
So, when the number of data points is increased, and therefore the number of computations per iteration you will see that the relative performances between TensorFlow and Python/NumPy shifts in the advantage of TensorFlow. The opposite is also true.
The problem described in the question is very small meaning that the number of computation is very low while the number of iterations is very large. That is why TensorFlow performs so badly. This type of small problems is not the typical use case for which TensorFlow was designed.
To reduce the execution time
Still the execution time of the TensorFlow script can be reduced a lot! To reduce the execution time the number of iterations must be reduced (no matter the size of the problem, this is a good aim anyway).
As #amin's pointed out, this is achieved by scaling the input data. A very briefly explanation why this works: the size of the gradient and variable updates are more balanced compared to the absolute values for which the values are to be found. Therefore, less steps (= iterations) are required.
Followings #amin's advise, I finally ended up by scaling my x-data as follows (some code is repeated to make the position of the new code clear):
# Tensorflow is finicky about shapes, so resize
x_data = np.reshape(x_data_input, (n_samples, 1))
y_data = np.reshape(y_data_input, (n_samples, 1))
### START NEW CODE ###
# Scale x_data
x_mean = np.mean(x_data)
x_std = np.std(x_data)
x_data = (x_data - x_mean) / x_std
### END NEW CODE ###
# Define placeholders for input
X = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_x_data")
Y = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_y_data")
Scaling speed up the convergence by a factor 1000. Instead of 1e5 iterations, 1e2 iterations are needed. This is partially because a maximum step size of 1e-1 can be used instead of a step size of 1e-4.
Please note that the found weight and bias are different and that you must feed scaled data from now on.
Optionally, you can choose to unscale the found weight and bias so you can feed unscaled data. Unscaling is done using this code (put somewhere at the end of the code):
#%% Unscaling
W_val_unscaled = W_val[0,0]/x_std
b_val_unscaled = b_val[0]-x_mean*W_val[0,0]/x_std

pytorch: sum of cross entropy over all classes

I want to compute sum of cross entropy over all classes for each prediction, where the input is batch (size n), and the output is batch (size n).
The simplest way is for loop (for 1000 classes):
def sum_of_CE_lost(input):
L = 0
for c in range(1000):
L = L + torch.nn.CrossEntropyLoss(input, c)
return L
However, it is very slow. What is a better way? How can we parallelized it for GPU (CUDA)?

First of all, to make it faster, you need to vectorize it, that is, work with matrices.
So, image you have 1,000 samples to compute the loss. Also, your classification problem has 5 labels. To compute the CrossEntropyLoss we need an input and a target. Let's simulate that as follows:
loss = nn.CrossEntropyLoss() # the loss function
input = torch.randn(1000, 5) #1000 samples and 5 labels' predictions
target = torch.empty(1000, dtype=torch.long).random_(5) # 1000 samples with labels from 0 to 4
loss_value = loss(input, target) # It'll output the loss
There we go! Now the loss is computed considering the 1,000 samples. This is the fatest way to do that.

I found the answer:
torch.nn.functional.log_softmax (input).sum() / input.shape[0]
We divide by input.shape[0] because cross_entropy() takes, by default the mean across the batch dimension.

Improving a simple 1 layer Neural Network

I've created my own very simple 1 layer neural network, specialised in binary classification problems. Where the input data-points are multiplied by the weights and a bias is added. The whole thing is summed (weighted-sum) and fed through an activation function (such as relu or sigmoid). That would be the prediction output. There are no other layers (i.e. hidden layers) involved.
Just for my own understanding of the mathematical side, I didn't want to use an existing library/package (e.g. Keras, PyTorch, Scikit-learn ..etc), but simply wanted to create a neural network using plain python code. The model is created inside a method (simple_1_layer_classification_NN) that takes the necessary parameters to make a prediction. However, I encountered some problems, and as such listed the questions below along with my code.
P.s. I really apologise for including such a large portion of code, but I didn't know how else to ask the questions without referencing the relevant code.
The questions:
1 - When I passed some training dataset to train the network, I found that the final average accuracy completely differed with different number of Epochs with absolutely no clear pattern to some sort of optimal number of Epochs. I kept the other parameters the same: learning rate = 0.5, activation = sigmoid(since it's 1 layer - being both the input and output layer. No hidden layers involved. I've read sigmoid is suited for output layer more than relu), cost function = squared error. Here are the results for different Epochs:
Epoch = 100,000.
Average Accuracy: 50.10541638874056
Epoch = 500,000.
Average Accuracy: 50.08965597645948
Epoch = 1,000,000.
Average Accuracy: 97.56879179064482
Epoch = 7,500,000.
Average Accuracy: 49.994692515332524
Epoch 750,000.
Average Accuracy: 77.0028368954157
Epoch = 100.
Average Accuracy: 48.96967591507596
Epoch = 500.
Average Accuracy: 48.20721972881673
Epoch = 10,000.
Average Accuracy: 71.58066454336122
Epoch = 50,000.
Average Accuracy: 62.52998222597177
Epoch = 100,000.
Average Accuracy: 49.813675726563424
Epoch = 1,000,000.
Average Accuracy: 49.993141329926374
As you can see there doesn't seem to be any clear pattern. I tried 1 million epochs and got 97.6% accuracy. Then I tried 7.5 million epochs got 50% accuracy. Half a million epochs also got 50% accuracy. 100 epochs resulted in 49% accuracy. Then the really odd one, tried 1 millions epochs again and got 50%.
So I'm sharing my code below, because I don't believe the network is doing any learning. Just seems like random guesses. I applied the concept of Back-propagation and partial derivative to optimise the weights and bias. So I'm not sure where I'm going wrong with my code.
2- One of the parameters I included in the parameter list of the simple_1_layer_classification_NN method, is the input_dimension parameter. At first I thought it would be needed to workout the number of weights required for the input layer. Then I realised, as long as the dataset_input_matrix (matrix of features) argument is passed to the method, I can access a random index of the matrix to access a random observation vector from the matrix (input_observation_vector = dataset_input_matrix[ri]). Then looping through the observation to access each feature. The number of loops (or length) of the observation vector will tell me exactly how many weights are required (because each feature will require one weight (as its coefficient). So (len(input_observation_vector)) will tell me the number of weights required in the input layer, and therefore I don't need to ask the user to pass input_dimension argument to the method.
So my question is simply, is there any need/reason to include a input_dimension parameter, when this can be worked out simply by evaluating the length of the observation vector from the input matrix?
3 - When I try to plot the array of costs values, nothing shows up - plt.plot(y_costs). A cost value (produced from every Epoch), is appended to the costs array only every 50 epochs. This is to avoid having so many cost elements added in the array if the number of epochs is really high. At line:
if i % 50 == 0:
costs.append(cost)
When I did some debugging, I found that the costs array is empty, after the method returns. I'm not sure why that is, when it should be appending a cost value every 50th epoch. Probably I've overlooked something really silly that I can't see it.
Many thanks in advance, and apologies again for the long piece of code.
from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
import sys
# import os
class NN_classification:
def __init__(self):
self.bias = float()
self.weights = []
self.chosen_activation_func = None
self.chosen_cost_func = None
self.train_average_accuracy = int()
self.test_average_accuracy = int()
# -- Activation functions --:
def sigmoid(x):
return 1/(1 + np.exp(-x))
def relu(x):
return np.maximum(0.0, x)
# -- Derivative of activation functions --:
def sigmoid_derivation(x):
return NN_classification.sigmoid(x) * (1-NN_classification.sigmoid(x))
def relu_derivation(x):
if x <= 0:
return 0
else:
return 1
# -- Squared-error cost function --:
def squared_error(pred, target):
return np.square(pred - target)
# -- Derivative of squared-error cost function --:
def squared_error_derivation(pred, target):
return 2 * (pred - target)
# --- neural network structure diagram ---
# O output prediction
# / \ w1, w2, b
# O O datapoint 1, datapoint 2
def simple_1_layer_classification_NN(self, dataset_input_matrix, output_data_labels, input_dimension, epochs, activation_func='sigmoid', learning_rate=0.2, cost_func='squared_error'):
weights = []
bias = int()
cost = float()
costs = []
dCost_dWeights = []
chosen_activation_func_derivation = None
chosen_cost_func = None
chosen_cost_func_derivation = None
correct_pred = int()
incorrect_pred = int()
# store the chosen activation function to use to it later on in the activation calculation section and in the 'predict' method
# Also the same goes for the derivation section.
if activation_func == 'sigmoid':
self.chosen_activation_func = NN_classification.sigmoid
chosen_activation_func_derivation = NN_classification.sigmoid_derivation
elif activation_func == 'relu':
self.chosen_activation_func = NN_classification.relu
chosen_activation_func_derivation = NN_classification.relu_derivation
else:
print("Exception error - no activation function utilised, in training method", file=sys.stderr)
return
# store the chosen cost function to use to it later on in the cost calculation section.
# Also the same goes for the cost derivation section.
if cost_func == 'squared_error':
chosen_cost_func = NN_classification.squared_error
chosen_cost_func_derivation = NN_classification.squared_error_derivation
else:
print("Exception error - no cost function utilised, in training method", file=sys.stderr)
return
# Set initial network parameters (weights & bias):
# Will initialise the weights to a uniform distribution and ensure the numbers are small close to 0.
# We need to loop through all the weights to set them to a random value initially.
for i in range(input_dimension):
# create random numbers for our initial weights (connections) to begin with. 'rand' method creates small random numbers.
w = np.random.rand()
weights.append(w)
# create a random number for our initial bias to begin with.
bias = np.random.rand()
# We perform the training based on the number of epochs specified
for i in range(epochs):
# create random index
ri = np.random.randint(len(dataset_input_matrix))
# Pick random observation vector: pick a random observation vector of independent variables (x) from the dataset matrix
input_observation_vector = dataset_input_matrix[ri]
# reset weighted sum value at the beginning of every epoch to avoid incrementing the previous observations weighted-sums on top.
weighted_sum = 0
# Loop through all the independent variables (x) in the observation
for i in range(len(input_observation_vector)):
# Weighted_sum: we take each independent variable in the entire observation, add weight to it then add it to the subtotal of weighted sum
weighted_sum += input_observation_vector[i] * weights[i]
# Add Bias: add bias to weighted sum
weighted_sum += bias
# Activation: process weighted_sum through activation function
activation_func_output = self.chosen_activation_func(weighted_sum)
# Prediction: Because this is a single layer neural network, so the activation output will be the same as the prediction
pred = activation_func_output
# Cost: the cost function to calculate the prediction error margin
cost = chosen_cost_func(pred, output_data_labels[ri])
# Also calculate the derivative of the cost function with respect to prediction
dCost_dPred = chosen_cost_func_derivation(pred, output_data_labels[ri])
# Derivative: bringing derivative from prediction output with respect to the activation function used for the weighted sum.
dPred_dWeightSum = chosen_activation_func_derivation(weighted_sum)
# Bias is just a number on its own added to the weighted sum, so its derivative is just 1
dWeightSum_dB = 1
# The derivative of the Weighted Sum with respect to each weight is the input data point / independant variable it's multiplied by.
# Therefore I simply assigned the input data array to another variable I called 'dWeightedSum_dWeights'
# to represent the array of the derivative of all the weights involved. I could've used the 'input_sample'
# array variable itself, but for the sake of readibility, I created a separate variable to represent the derivative of each of the weights.
dWeightedSum_dWeights = input_observation_vector
# Derivative chaining rule: chaining all the derivative functions together (chaining rule)
# Loop through all the weights to workout the derivative of the cost with respect to each weight:
for dWeightedSum_dWeight in dWeightedSum_dWeights:
dCost_dWeight = dCost_dPred * dPred_dWeightSum * dWeightedSum_dWeight
dCost_dWeights.append(dCost_dWeight)
dCost_dB = dCost_dPred * dPred_dWeightSum * dWeightSum_dB
# Backpropagation: update the weights and bias according to the derivatives calculated above.
# In other word we update the parameters of the neural network to correct parameters and therefore
# optimise the neural network prediction to be as accurate to the real output as possible
# We loop through each weight and update it with its derivative with respect to the cost error function value.
for i in range(len(weights)):
weights[i] = weights[i] - learning_rate * dCost_dWeights[i]
bias = bias - learning_rate * dCost_dB
# for each 50th loop we're going to get a summary of the
# prediction compared to the actual ouput
# to see if the prediction is as expected.
# Anything in prediction above 0.5 should match value
# 1 of the actual ouptut. Any prediction below 0.5 should
# match value of 0 for actual output
if i % 50 == 0:
costs.append(cost)
# Compare prediction to target
error_margin = np.sqrt(np.square(pred - output_data_labels[ri]))
accuracy = (1 - error_margin) * 100
self.train_average_accuracy += accuracy
# Evaluate whether guessed correctly or not based on classification binary problem 0 or 1 outcome. So if prediction is above 0.5 it guessed 1 and below 0.5 it guessed incorrectly. If it's dead on 0.5 it is incorrect for either guesses. Because it's no exactly a good guess for either 0 or 1. We need to set a good standard for the neural net model.
if (error_margin < 0.5) and (error_margin >= 0):
correct_pred += 1
elif (error_margin >= 0.5) and (error_margin <= 1):
incorrect_pred += 1
else:
print("Exception error - 'margin error' for 'predict' method is out of range. Must be between 0 and 1, in training method", file=sys.stderr)
return
# store the final optimised weights to the weights instance variable so it can be used in the predict method.
self.weights = weights
# store the final optimised bias to the weights instance variable so it can be used in the predict method.
self.bias = bias
# Calculate average accuracy from the predictions of all obervations in the training dataset
self.train_average_accuracy /= epochs
# Print out results
print('Average Accuracy: {}'.format(self.train_average_accuracy))
print('Correct predictions: {}, Incorrect Predictions: {}'.format(correct_pred, incorrect_pred))
print('costs = {}'.format(costs))
y_costs = np.array(costs)
plt.plot(y_costs)
plt.show()
from numpy import array
#define array of dataset
# each observation vector has 3 datapoints or 3 columns: length, width, and outcome label (0, 1 to represent blue flower and red flower respectively).
data = array([[3, 1.5, 1],
[2, 1, 0],
[4, 1.5, 1],
[3, 1, 0],
[3.5, 0.5, 1],
[2, 0.5, 0],
[5.5, 1, 1],
[1, 1, 0]])
# separate data: split input, output, train and test data.
X_train, y_train, X_test, y_test = data[:6, :-1], data[:6, -1], data[6:, :-1], data[6:, -1]
nn_model = NN_classification()
nn_model.simple_1_layer_classification_NN(X_train, y_train, 2, 1000000, learning_rate=0.5)

Have you tried a smaller learning rate? Your network may be skipping over local minima because it is too high.
Here's an article that goes more in-depth on learning rates: https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10
The reason that the cost is never getting appended is because you are using the same variable, 'i', within nested for loops.
# We perform the training based on the number of epochs specified
for i in range(epochs):
# create random index
ri = np.random.randint(len(dataset_input_matrix))
# Pick random observation vector: pick a random observation vector of independent variables (x) from the dataset matrix
input_observation_vector = dataset_input_matrix[ri]
# reset weighted sum value at the beginning of every epoch to avoid incrementing the previous observations weighted-sums on top.
weighted_sum = 0
# Loop through all the independent variables (x) in the observation
for i in range(len(input_observation_vector)):
# Weighted_sum: we take each independent variable in the entire observation, add weight to it then add it to the subtotal of weighted sum
weighted_sum += input_observation_vector[i] * weights[i]
# Add Bias: add bias to weighted sum
weighted_sum += bias
# Activation: process weighted_sum through activation function
activation_func_output = self.chosen_activation_func(weighted_sum)
# Prediction: Because this is a single layer neural network, so the activation output will be the same as the prediction
pred = activation_func_output
# Cost: the cost function to calculate the prediction error margin
cost = chosen_cost_func(pred, output_data_labels[ri])
# Also calculate the derivative of the cost function with respect to prediction
dCost_dPred = chosen_cost_func_derivation(pred, output_data_labels[ri])
# Derivative: bringing derivative from prediction output with respect to the activation function used for the weighted sum.
dPred_dWeightSum = chosen_activation_func_derivation(weighted_sum)
# Bias is just a number on its own added to the weighted sum, so its derivative is just 1
dWeightSum_dB = 1
# The derivative of the Weighted Sum with respect to each weight is the input data point / independant variable it's multiplied by.
# Therefore I simply assigned the input data array to another variable I called 'dWeightedSum_dWeights'
# to represent the array of the derivative of all the weights involved. I could've used the 'input_sample'
# array variable itself, but for the sake of readibility, I created a separate variable to represent the derivative of each of the weights.
dWeightedSum_dWeights = input_observation_vector
# Derivative chaining rule: chaining all the derivative functions together (chaining rule)
# Loop through all the weights to workout the derivative of the cost with respect to each weight:
for dWeightedSum_dWeight in dWeightedSum_dWeights:
dCost_dWeight = dCost_dPred * dPred_dWeightSum * dWeightedSum_dWeight
dCost_dWeights.append(dCost_dWeight)
dCost_dB = dCost_dPred * dPred_dWeightSum * dWeightSum_dB
# Backpropagation: update the weights and bias according to the derivatives calculated above.
# In other word we update the parameters of the neural network to correct parameters and therefore
# optimise the neural network prediction to be as accurate to the real output as possible
# We loop through each weight and update it with its derivative with respect to the cost error function value.
for i in range(len(weights)):
weights[i] = weights[i] - learning_rate * dCost_dWeights[i]
bias = bias - learning_rate * dCost_dB
# for each 50th loop we're going to get a summary of the
# prediction compared to the actual ouput
# to see if the prediction is as expected.
# Anything in prediction above 0.5 should match value
# 1 of the actual ouptut. Any prediction below 0.5 should
# match value of 0 for actual output
This was causing 'i' to always be 1 when it got to the if statement
if i % 50 == 0:
costs.append(cost)
# Compare prediction to target
error_margin = np.sqrt(np.square(pred - output_data_labels[ri]))
accuracy = (1 - error_margin) * 100
self.train_average_accuracy += accuracy
Edit
So I tried training the model 1000 times with random learning rates between 0 and 1, and the initial learning rate doesn't seem to make any difference. 0.3% of these achieved accuracies above 0.60, and none of them were above 70%.
Then I ran the same test with an adaptive learning rate:
# Modify the learning rate based on the cost
# Placed just before the bias is calculated
learning_rate = 0.999 * learning_rate + 0.1 * cost
This is resulting in about 10-12% of the models having an accuracy above 60%, and about 2.5% of them are above 70%

Optical Character Recognition using Neural Networks in Python

This code is for OCR using ANN ,it contains one hidden layer, the input is an image of size 28x28.the code runs without any error but the output is not at all accurate even after giving 5000+ images for training.I am using the mnist dataset which is of the form of jpg images. Please tell me what is wrong with my logic.
import numpy as np
from PIL import Image
import random
from random import randint
y = [[0,0,0,0,0,0,0,0,0,0]]
W1 = [[ random.uniform(-1, 1) for q in range(40)] for p in range(784)]
W2 = [[ random.uniform(-1, 1) for q in range(10)] for p in range(40)]
def sigmoid(x):
global b
return (1.0 / (1.0 + np.exp(-x)))
#run the neural net forward
def run(X, W):
return sigmoid(np.matmul(X,W)) #1x2 * 2x2 = 1x1 matrix
#cost function
def cost(X, y, W):
nn_output = run(X, W)
return ((nn_output - y))
def gradient_Descent(X,y,W1,W2):
alpha = 0.12 #learning rate
epochs = 15000 #num iterations
for i in range(epochs):
Z2=sigmoid(np.matmul(run(X,W1),W2)) #final activation function(1X10))
Z1=run(X,W1) #first activation function(1X40)
phi1=Z1*(1-Z1) #differentiation of Z1
phi2=Z2*(1-Z2) #differentiation of Z2
delta2 = phi2*cost(Z1,y,W2) #delta for outer layer(1X10)
delta1 = np.transpose(np.transpose(phi1)*np.matmul(W2,np.transpose(delta2)))
deltaW2 = alpha*(np.matmul(np.transpose(Z1),delta2))
deltaW1 = alpha*(np.matmul(np.transpose(X),delta1))
W1=W1+deltaW1
W2=W2+deltaW2
def Training():
for j in range(8):
y[0][j]=1
k=1
while k<=15: #5421
print(k)
q=0
img = Image.open('mnist_jpgfiles/train/mnist_'+str(j)+'_'+str(k)+'.jpg')
iar = np.array(img) #image array
ar=np.reshape(iar,(1,np.product(iar.shape)))
ar=np.array(ar,dtype=float)
X = ar
'''
for p in range(784):
if X[0][p]>0:
X[0][p]=1
else:
X[0][p]=0
'''
k+=1
gradient_Descent(X,y,W1,W2)
print(np.argmin(cost(run(X,W1),y,W2)))
#print(W1)
y[0][j]=0
Training()
def test():
global W1,W2
for j in range(3):
k=1
while k<=5: #890
img = Image.open('mnist_jpgfiles/test/mnist_'+str(j)+'_'+str(k)+'.jpg')
iar = np.array(img) #image array
ar=np.reshape(iar,(1,np.product(iar.shape)))
ar=np.array(ar,dtype=float)
X = ar/256
'''
for p in range(784):
if X[0][p]>0:
X[0][p]=1
else:
X[0][p]=0
'''
k+=1
print("Should be "+str(j))
print((run(run(X,W1),W2)))
print((np.argmax(run(run(X,W1),W2))))
print("Testing.....")
test()

There is a problem with your cost function, because you simply calculate the difference between the hypothesis output with the actual output.It makes your cost function linear, so it's strictly increasing(or strictly decreasing), which can't be optimized.
You need to make a cross-entropy cost function(because you use sigmoid as activation function).
Also, gradient descent simply can't optimize ANN cost function, you should use back-propagation with gradient descent to optimize it.

I haven't worked with ANN but when working with gradient descent algorithm for regression problems like in Andrew Nag Machine Learning course in coursera, I found it is helpful to have learning rate alpha less than 0.05 and no of iterations more than 100000.
Try tweaking your learning rate then create a confusion matrix which will help you understand the accuracy of your system.

In my experience there are a lot of things that can go wrong with an ANN. I'll list some possible errors for you to consider.
Assuming the classification accuracy does not increase at all after training.
Something is wrong with the training or testing sets.
Too high
learning rates can sometimes cause the algorithm to not converge at
all. Try setting it very small like 0.01 or 0.001. If there is still no convergence. The issue probably has to do with something else than the gradient descent.
Assuming the training does increase but the accuracy is worse than expected.
The normalisation process is not correctly implemented. For images it is recommended to use zero-mean-unit-variance.
The learning rate is too low or too high

Linear regression implementation always performs worse than sklearn

I implemented linear regression with gradient descent in python. To see how well it is doing I compared it with scikit-learn's LinearRegression() class. For some reason, sklearn always outperforms my program by a MSE of 3 on average (I am using the Boston Housing dataset for testing). I understand that I am currently not doing gradient checking to check for convergence, but I am allowing for many iterations and have set the learning rate low enough such that it SHOULD converge. Is there any clear bug in my learning algorithm implementation? Here is my code:
import numpy as np
from sklearn.linear_model import LinearRegression
def getWeights(x):
lenWeights = len(x[1,:]);
weights = np.random.rand(lenWeights)
bias = np.random.random();
return weights,bias
def train(x,y,weights,bias,maxIter):
converged = False;
iterations = 1;
m = len(x);
alpha = 0.001;
while not converged:
for i in range(len(x)):
# Dot product of weights and training sample
hypothesis = np.dot(x[i,:], weights) + bias;
# Calculate gradient
error = hypothesis - y[i];
grad = (alpha * 1/m) * ( error * x[i,:] );
# Update weights and bias
weights = weights - grad;
bias = bias - alpha * error;
iterations = iterations + 1;
if iterations > maxIter:
converged = True;
break
return weights, bias
def predict(x, weights, bias):
return np.dot(x,weights) + bias
if __name__ == '__main__':
data = np.loadtxt('housing.txt');
x = data[:,:-1];
y = data[:,-1];
for i in range(len(x[1,:])):
x[:,i] = ( (x[:,i] - np.min(x[:,i])) / (np.max(x[:,i]) - np.min(x[:,i])) );
initialWeights,initialBias = getWeights(x);
weights,bias = train(x,y,initialWeights,initialBias,55000);
pred = predict(x, weights,bias);
MSE = np.mean(abs(pred - y));
print "This Program MSE: " + str(MSE)
sklearnModel = LinearRegression();
sklearnModel = sklearnModel.fit(x,y);
sklearnModel = sklearnModel.predict(x);
skMSE = np.mean(abs(sklearnModel - y));
print "Sklearn MSE: " + str(skMSE)

First, make sure that you are computing the correct objective function value. The linear regression objective should be .5*np.mean((pred-y)**2), rather than np.mean(abs(pred - y)).
You are actually running a stochastic gradient descent (SGD) algorithm (running a gradient iteration on individual examples), which should be distinguished from "gradient descent".
SGD is a good learning method, but a bad optimization method - it can take many iterations to converge to a minimum of the empirical error (http://leon.bottou.org/publications/pdf/nips-2007.pdf).
For SGD to converge, the learning rate must be restricted. Typically, the learning rate is set to the base learning rate divided by the number of iterations, something like alpha/(iterations+1), using the variables in your code.
You also include a multiple of 1/m in your gradient, which is typically not used in SGD updates.
To test your SGD implementation, rather than evaluating the error on the dataset that you trained with, split the dataset into a training set and a test set, and evaluate the error on this test set after training with both methods. The training/test set split will allow you to estimate the performance of your algorithm as a learning algorithm (estimate the expected error) rather than as an optimization algorithm (minimize the empirical error).

Try increasing your iteration value. This should allow your algorithm to, hopefully, converge on a value that is closer to the global minimum. Keep in mind you are not using l-bfgs which can come closer to converging much faster than plain gradient descent or even SGD.
Also try using the normal equation as another way to do Linear Regression.
http://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression/.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Debugging a Neural Network - python

Related

Gradient descent using TensorFlow is much slower than a basic Python implementation, why?

pytorch: sum of cross entropy over all classes

Improving a simple 1 layer Neural Network

Optical Character Recognition using Neural Networks in Python

Linear regression implementation always performs worse than sklearn

Categories

Resources