Improving a simple 1 layer Neural Network - python

I've created my own very simple 1 layer neural network, specialised in binary classification problems. Where the input data-points are multiplied by the weights and a bias is added. The whole thing is summed (weighted-sum) and fed through an activation function (such as relu or sigmoid). That would be the prediction output. There are no other layers (i.e. hidden layers) involved.
Just for my own understanding of the mathematical side, I didn't want to use an existing library/package (e.g. Keras, PyTorch, Scikit-learn ..etc), but simply wanted to create a neural network using plain python code. The model is created inside a method (simple_1_layer_classification_NN) that takes the necessary parameters to make a prediction. However, I encountered some problems, and as such listed the questions below along with my code.
P.s. I really apologise for including such a large portion of code, but I didn't know how else to ask the questions without referencing the relevant code.
The questions:
1 - When I passed some training dataset to train the network, I found that the final average accuracy completely differed with different number of Epochs with absolutely no clear pattern to some sort of optimal number of Epochs. I kept the other parameters the same: learning rate = 0.5, activation = sigmoid(since it's 1 layer - being both the input and output layer. No hidden layers involved. I've read sigmoid is suited for output layer more than relu), cost function = squared error. Here are the results for different Epochs:
Epoch = 100,000.
Average Accuracy: 50.10541638874056
Epoch = 500,000.
Average Accuracy: 50.08965597645948
Epoch = 1,000,000.
Average Accuracy: 97.56879179064482
Epoch = 7,500,000.
Average Accuracy: 49.994692515332524
Epoch 750,000.
Average Accuracy: 77.0028368954157
Epoch = 100.
Average Accuracy: 48.96967591507596
Epoch = 500.
Average Accuracy: 48.20721972881673
Epoch = 10,000.
Average Accuracy: 71.58066454336122
Epoch = 50,000.
Average Accuracy: 62.52998222597177
Epoch = 100,000.
Average Accuracy: 49.813675726563424
Epoch = 1,000,000.
Average Accuracy: 49.993141329926374
As you can see there doesn't seem to be any clear pattern. I tried 1 million epochs and got 97.6% accuracy. Then I tried 7.5 million epochs got 50% accuracy. Half a million epochs also got 50% accuracy. 100 epochs resulted in 49% accuracy. Then the really odd one, tried 1 millions epochs again and got 50%.
So I'm sharing my code below, because I don't believe the network is doing any learning. Just seems like random guesses. I applied the concept of Back-propagation and partial derivative to optimise the weights and bias. So I'm not sure where I'm going wrong with my code.
2- One of the parameters I included in the parameter list of the simple_1_layer_classification_NN method, is the input_dimension parameter. At first I thought it would be needed to workout the number of weights required for the input layer. Then I realised, as long as the dataset_input_matrix (matrix of features) argument is passed to the method, I can access a random index of the matrix to access a random observation vector from the matrix (input_observation_vector = dataset_input_matrix[ri]). Then looping through the observation to access each feature. The number of loops (or length) of the observation vector will tell me exactly how many weights are required (because each feature will require one weight (as its coefficient). So (len(input_observation_vector)) will tell me the number of weights required in the input layer, and therefore I don't need to ask the user to pass input_dimension argument to the method.
So my question is simply, is there any need/reason to include a input_dimension parameter, when this can be worked out simply by evaluating the length of the observation vector from the input matrix?
3 - When I try to plot the array of costs values, nothing shows up - plt.plot(y_costs). A cost value (produced from every Epoch), is appended to the costs array only every 50 epochs. This is to avoid having so many cost elements added in the array if the number of epochs is really high. At line:
if i % 50 == 0:
When I did some debugging, I found that the costs array is empty, after the method returns. I'm not sure why that is, when it should be appending a cost value every 50th epoch. Probably I've overlooked something really silly that I can't see it.
Many thanks in advance, and apologies again for the long piece of code.
from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
import sys
# import os
class NN_classification:
def __init__(self):
self.bias = float()
self.weights = []
self.chosen_activation_func = None
self.chosen_cost_func = None
self.train_average_accuracy = int()
self.test_average_accuracy = int()
# -- Activation functions --:
def sigmoid(x):
return 1/(1 + np.exp(-x))
def relu(x):
return np.maximum(0.0, x)
# -- Derivative of activation functions --:
def sigmoid_derivation(x):
return NN_classification.sigmoid(x) * (1-NN_classification.sigmoid(x))
def relu_derivation(x):
if x <= 0:
return 0
return 1
# -- Squared-error cost function --:
def squared_error(pred, target):
return np.square(pred - target)
# -- Derivative of squared-error cost function --:
def squared_error_derivation(pred, target):
return 2 * (pred - target)
# --- neural network structure diagram ---
# O output prediction
# / \ w1, w2, b
# O O datapoint 1, datapoint 2
def simple_1_layer_classification_NN(self, dataset_input_matrix, output_data_labels, input_dimension, epochs, activation_func='sigmoid', learning_rate=0.2, cost_func='squared_error'):
weights = []
bias = int()
cost = float()
costs = []
dCost_dWeights = []
chosen_activation_func_derivation = None
chosen_cost_func = None
chosen_cost_func_derivation = None
correct_pred = int()
incorrect_pred = int()
# store the chosen activation function to use to it later on in the activation calculation section and in the 'predict' method
# Also the same goes for the derivation section.
if activation_func == 'sigmoid':
self.chosen_activation_func = NN_classification.sigmoid
chosen_activation_func_derivation = NN_classification.sigmoid_derivation
elif activation_func == 'relu':
self.chosen_activation_func = NN_classification.relu
chosen_activation_func_derivation = NN_classification.relu_derivation
print("Exception error - no activation function utilised, in training method", file=sys.stderr)
# store the chosen cost function to use to it later on in the cost calculation section.
# Also the same goes for the cost derivation section.
if cost_func == 'squared_error':
chosen_cost_func = NN_classification.squared_error
chosen_cost_func_derivation = NN_classification.squared_error_derivation
print("Exception error - no cost function utilised, in training method", file=sys.stderr)
# Set initial network parameters (weights & bias):
# Will initialise the weights to a uniform distribution and ensure the numbers are small close to 0.
# We need to loop through all the weights to set them to a random value initially.
for i in range(input_dimension):
# create random numbers for our initial weights (connections) to begin with. 'rand' method creates small random numbers.
w = np.random.rand()
# create a random number for our initial bias to begin with.
bias = np.random.rand()
# We perform the training based on the number of epochs specified
for i in range(epochs):
# create random index
ri = np.random.randint(len(dataset_input_matrix))
# Pick random observation vector: pick a random observation vector of independent variables (x) from the dataset matrix
input_observation_vector = dataset_input_matrix[ri]
# reset weighted sum value at the beginning of every epoch to avoid incrementing the previous observations weighted-sums on top.
weighted_sum = 0
# Loop through all the independent variables (x) in the observation
for i in range(len(input_observation_vector)):
# Weighted_sum: we take each independent variable in the entire observation, add weight to it then add it to the subtotal of weighted sum
weighted_sum += input_observation_vector[i] * weights[i]
# Add Bias: add bias to weighted sum
weighted_sum += bias
# Activation: process weighted_sum through activation function
activation_func_output = self.chosen_activation_func(weighted_sum)
# Prediction: Because this is a single layer neural network, so the activation output will be the same as the prediction
pred = activation_func_output
# Cost: the cost function to calculate the prediction error margin
cost = chosen_cost_func(pred, output_data_labels[ri])
# Also calculate the derivative of the cost function with respect to prediction
dCost_dPred = chosen_cost_func_derivation(pred, output_data_labels[ri])
# Derivative: bringing derivative from prediction output with respect to the activation function used for the weighted sum.
dPred_dWeightSum = chosen_activation_func_derivation(weighted_sum)
# Bias is just a number on its own added to the weighted sum, so its derivative is just 1
dWeightSum_dB = 1
# The derivative of the Weighted Sum with respect to each weight is the input data point / independant variable it's multiplied by.
# Therefore I simply assigned the input data array to another variable I called 'dWeightedSum_dWeights'
# to represent the array of the derivative of all the weights involved. I could've used the 'input_sample'
# array variable itself, but for the sake of readibility, I created a separate variable to represent the derivative of each of the weights.
dWeightedSum_dWeights = input_observation_vector
# Derivative chaining rule: chaining all the derivative functions together (chaining rule)
# Loop through all the weights to workout the derivative of the cost with respect to each weight:
for dWeightedSum_dWeight in dWeightedSum_dWeights:
dCost_dWeight = dCost_dPred * dPred_dWeightSum * dWeightedSum_dWeight
dCost_dB = dCost_dPred * dPred_dWeightSum * dWeightSum_dB
# Backpropagation: update the weights and bias according to the derivatives calculated above.
# In other word we update the parameters of the neural network to correct parameters and therefore
# optimise the neural network prediction to be as accurate to the real output as possible
# We loop through each weight and update it with its derivative with respect to the cost error function value.
for i in range(len(weights)):
weights[i] = weights[i] - learning_rate * dCost_dWeights[i]
bias = bias - learning_rate * dCost_dB
# for each 50th loop we're going to get a summary of the
# prediction compared to the actual ouput
# to see if the prediction is as expected.
# Anything in prediction above 0.5 should match value
# 1 of the actual ouptut. Any prediction below 0.5 should
# match value of 0 for actual output
if i % 50 == 0:
# Compare prediction to target
error_margin = np.sqrt(np.square(pred - output_data_labels[ri]))
accuracy = (1 - error_margin) * 100
self.train_average_accuracy += accuracy
# Evaluate whether guessed correctly or not based on classification binary problem 0 or 1 outcome. So if prediction is above 0.5 it guessed 1 and below 0.5 it guessed incorrectly. If it's dead on 0.5 it is incorrect for either guesses. Because it's no exactly a good guess for either 0 or 1. We need to set a good standard for the neural net model.
if (error_margin < 0.5) and (error_margin >= 0):
correct_pred += 1
elif (error_margin >= 0.5) and (error_margin <= 1):
incorrect_pred += 1
print("Exception error - 'margin error' for 'predict' method is out of range. Must be between 0 and 1, in training method", file=sys.stderr)
# store the final optimised weights to the weights instance variable so it can be used in the predict method.
self.weights = weights
# store the final optimised bias to the weights instance variable so it can be used in the predict method.
self.bias = bias
# Calculate average accuracy from the predictions of all obervations in the training dataset
self.train_average_accuracy /= epochs
# Print out results
print('Average Accuracy: {}'.format(self.train_average_accuracy))
print('Correct predictions: {}, Incorrect Predictions: {}'.format(correct_pred, incorrect_pred))
print('costs = {}'.format(costs))
y_costs = np.array(costs)
from numpy import array
#define array of dataset
# each observation vector has 3 datapoints or 3 columns: length, width, and outcome label (0, 1 to represent blue flower and red flower respectively).
data = array([[3, 1.5, 1],
[2, 1, 0],
[4, 1.5, 1],
[3, 1, 0],
[3.5, 0.5, 1],
[2, 0.5, 0],
[5.5, 1, 1],
[1, 1, 0]])
# separate data: split input, output, train and test data.
X_train, y_train, X_test, y_test = data[:6, :-1], data[:6, -1], data[6:, :-1], data[6:, -1]
nn_model = NN_classification()
nn_model.simple_1_layer_classification_NN(X_train, y_train, 2, 1000000, learning_rate=0.5)

Have you tried a smaller learning rate? Your network may be skipping over local minima because it is too high.
Here's an article that goes more in-depth on learning rates:
The reason that the cost is never getting appended is because you are using the same variable, 'i', within nested for loops.
# We perform the training based on the number of epochs specified
for i in range(epochs):
# create random index
ri = np.random.randint(len(dataset_input_matrix))
# Pick random observation vector: pick a random observation vector of independent variables (x) from the dataset matrix
input_observation_vector = dataset_input_matrix[ri]
# reset weighted sum value at the beginning of every epoch to avoid incrementing the previous observations weighted-sums on top.
weighted_sum = 0
# Loop through all the independent variables (x) in the observation
for i in range(len(input_observation_vector)):
# Weighted_sum: we take each independent variable in the entire observation, add weight to it then add it to the subtotal of weighted sum
weighted_sum += input_observation_vector[i] * weights[i]
# Add Bias: add bias to weighted sum
weighted_sum += bias
# Activation: process weighted_sum through activation function
activation_func_output = self.chosen_activation_func(weighted_sum)
# Prediction: Because this is a single layer neural network, so the activation output will be the same as the prediction
pred = activation_func_output
# Cost: the cost function to calculate the prediction error margin
cost = chosen_cost_func(pred, output_data_labels[ri])
# Also calculate the derivative of the cost function with respect to prediction
dCost_dPred = chosen_cost_func_derivation(pred, output_data_labels[ri])
# Derivative: bringing derivative from prediction output with respect to the activation function used for the weighted sum.
dPred_dWeightSum = chosen_activation_func_derivation(weighted_sum)
# Bias is just a number on its own added to the weighted sum, so its derivative is just 1
dWeightSum_dB = 1
# The derivative of the Weighted Sum with respect to each weight is the input data point / independant variable it's multiplied by.
# Therefore I simply assigned the input data array to another variable I called 'dWeightedSum_dWeights'
# to represent the array of the derivative of all the weights involved. I could've used the 'input_sample'
# array variable itself, but for the sake of readibility, I created a separate variable to represent the derivative of each of the weights.
dWeightedSum_dWeights = input_observation_vector
# Derivative chaining rule: chaining all the derivative functions together (chaining rule)
# Loop through all the weights to workout the derivative of the cost with respect to each weight:
for dWeightedSum_dWeight in dWeightedSum_dWeights:
dCost_dWeight = dCost_dPred * dPred_dWeightSum * dWeightedSum_dWeight
dCost_dB = dCost_dPred * dPred_dWeightSum * dWeightSum_dB
# Backpropagation: update the weights and bias according to the derivatives calculated above.
# In other word we update the parameters of the neural network to correct parameters and therefore
# optimise the neural network prediction to be as accurate to the real output as possible
# We loop through each weight and update it with its derivative with respect to the cost error function value.
for i in range(len(weights)):
weights[i] = weights[i] - learning_rate * dCost_dWeights[i]
bias = bias - learning_rate * dCost_dB
# for each 50th loop we're going to get a summary of the
# prediction compared to the actual ouput
# to see if the prediction is as expected.
# Anything in prediction above 0.5 should match value
# 1 of the actual ouptut. Any prediction below 0.5 should
# match value of 0 for actual output
This was causing 'i' to always be 1 when it got to the if statement
if i % 50 == 0:
# Compare prediction to target
error_margin = np.sqrt(np.square(pred - output_data_labels[ri]))
accuracy = (1 - error_margin) * 100
self.train_average_accuracy += accuracy
So I tried training the model 1000 times with random learning rates between 0 and 1, and the initial learning rate doesn't seem to make any difference. 0.3% of these achieved accuracies above 0.60, and none of them were above 70%.
Then I ran the same test with an adaptive learning rate:
# Modify the learning rate based on the cost
# Placed just before the bias is calculated
learning_rate = 0.999 * learning_rate + 0.1 * cost
This is resulting in about 10-12% of the models having an accuracy above 60%, and about 2.5% of them are above 70%


How can I correctly implement backpropagation using categorical cross-entropy as my loss function in just Numpy and Pandas?

Long story short: How do I fix my backpropagation code, so that weights and bias are being changed effecively by my evaluate() function to present predictions closer to the target values rather than odds no better than guessing?
Details below:
I've currently got the backbone of this neural network from scratch which I'm creating using techniques gleaned from Sentdex's Neural Networks From Scratch series on YouTube and a Towards Data Science article for the backpropagation part specifically. It works by creating a large class called Neural Network which would have several LayerDense objects associated by composition, which would act as each layer within the neural network.
As my inputs to the neural network, I pass in a batch of 8 records from a Pandas DataFrame, each containing 100 values of 0 or 1, depending on their prefered options. As target values, I pass in another DataFrame containing the actual genders of each participant, with 0 being male and 1 being female.
These LayerDense objects would deal with the forward and backward passes of each layer. Prior to implementing the softmax function and backpropagation, this all worked as expected.
My current issue is getting the evaluate() function within the program to run as expected & getting the run() function to handle this information correctly.
In theory, the evaluate function should return the loss of each neuron and the run function should handle this and run the backward pass through each neuron, adjusting its weights & biases appropiately.
What actually happens is that my final outputs of classification, which are the confidence levels in predictions, with values closer to 0 representing a male gender prediction and values closer to 1 representing a female gender prediction.
Using categorical cross-entropy as my loss function, how would I properly implement backpropagation in this situation? What may I be doing wrong here?
All resource links used to get this far and the whole source code will be linked below.
Current evaluation code
def evaluate(self):
#Target values are the y values that we want to be predicting correctly
#You can calculate the loss of a categorical neural network (basically most NN) by using
#categorical cross-entropy
#Using one-hot encoding to calculate the categorical cross-entropy of data (loss)
#In one-hot encoding, we assign the target class position we want in our array of outputs
#Then make an array of 0s of the same length as outputs but put a 1 in the target class position
#This basically simplifies to just the negative natural logarithm of the predicted target value
#The following code will represent the confidence values in the predictions made by the NN
#For this to work, if categorical, the number of outputs must equal the number of possible class targets
#E.g for gender, there's two possible class targets (0 and 1), so two output neurons
#The string can be changed to the attribute in the table that you shall be predicting
#A short but ugly way of getting a start to complete this task
loss = -np.log(self._network[-1].output[range(len(self._network[-1].output)),target_values.loc[:,"gender"]])
average_loss = np.mean(loss)
#A nicer way to accomplish the same thing
samples = len(self._network[-1].output)
#Clip the values so we don't get any infinity errors if a confidence level happens to be spot on
y_pred_clipped = np.clip(self._network[-1].output, 1e-7, 1-1e-7)
#If one-hot encoding has not been passed in
if len(self._target_values.shape) == 1:
#Selecting the largest confidences based on their position
correct_confidences = y_pred_clipped[range(samples),self._target_values[:samples]]
elif len(self._target_values.shape) == 2:
#One-hot encoding has been used in this scenario
correct_confidences = np.sum(y_pred_clipped*self._target_values[:samples], axis=1)
#Calculate the loss and return
loss = -np.log(correct_confidences)
return loss, correct_confidences
Current run() code
def run(self, **kwargs):
epochs = kwargs['epochs']
#Start by putting initial inputs into the input layer and generating the network
for i in range(len(self._network)-1):
#Using the previous layer's outputs as the next layer's inputs
for i in range(epochs):
#Forward pass
for i in range(len(self._network)-1):
output = self._network[i+1].forward_pass(self._network[i].output)
#Generates the values for loss function, used for training in multiple passes
#Backbone of backpropagation
loss = neural.evaluate()
#Backward pass
#Somehow find a way to derive the evalaute function on predicted values and target values
error, confidences = [np.e**-x for x in loss]
confidences = [np.e**-x for x in confidences]
error = confidences
for i in range(len(self._network)-1,-1):
error = self._network[i-1].backward(error, self._learning_rate)
print('Epoch %d/%d' % (i+1, epochs))
#Start by putting initial inputs into the input layer
for i in range(len(self._network)-1):
#Using the previous layer's outputs as the next layer's inputs
print("The network's testing outputs were:", self._network[-1].output)
Backward pass code which runs for each layer
def backward(self, output_error, learning_rate):
#The error of this layer's inputs is equal to its output error multipled by the
#transposed weights of the layer
input_error =, self.weights.T)
#The error of the weights in this layer is equal to the transposed matrix of inputs fed into the layer
#multipled by the error of the output from this layer
weights_error =, output_error)
# dBias = output_error
# update parameters
self.weights -= learning_rate * weights_error
self.biases -= learning_rate * output_error
return input_error
Aforementioned softmax function within forward() function of LayerDense
elif self._activation_function.lower() == 'softmax':
#Exponentiate (e to the power of x) values and subtract largest value of layer to prevent overflow
#Afterwards, normalise (put as relative fractions) the output values
#In theory, to get the max value out of each batch, axis should be set to 1 and keepdims should be True
neuron_output = np.exp(neuron_output - np.max(layer_output,axis=0)) / np.sum(np.exp(layer_output),axis=0)
Mentioned SentDex tutorial:
Mentioned TDS article:
Source code:

Average error and standard deviation of error within epoch not correctly updating - PyTorch

I am attempting to use Stochastic Gradient Descent but I am unsure as to why my error/loss is not decreasing. The information I am using from the train dataframe is the index (each sequence) and the binding affinity, and the goal is to predict the binding affinity. Here is what the head of the dataframe looks like:
For the training, I make a one-hot of a sequence and calculate a score with another matrix, and the goal is to get this score to be as close to the binding affinity as possible (for any given peptide). How I calculate the score and my training loop is shown in my code below but I don't think an explanation is necessary to solve why my error fails to decrease.
def p_one_hot(seq):
c2i = dict((c,i) for i,c in enumerate(aa))
int_encoded = [c2i[char] for char in seq]
onehot_encoded = list()
for value in int_encoded:
letter = [0 for _ in range(len(aa))]
letter[value] = 1
a=Var(torch.randn(20,1),requires_grad=True) #initalize similarity matrix - random array of 20 numbers
freq_m=Var(torch.randn(12,20),requires_grad=True) to 1 scaling
optimizer = optim.SGD([torch.nn.Parameter(a), torch.nn.Parameter(freq_m)], lr=1e-6)
loss = nn.MSELoss()
epochs = 100
for i in range(epochs):
train = all_seq.sample(frac=.03)
names = train.index.values.tolist()
affinities = train['binding_affinity']
print('Epoch: ' + str(i))
#forward pass
for j, seq in enumerate(names):,a.t()) #make simalirity matrix square symmetric,keepdim=True) #sum of each row must be 1 (sum of probabilities of each amino acid at each position)
affin_score = affinities[j]
new_m =, freq_m)
tss_m = new_m * sm
tss_score = tss_m.sum()
sms = sm
fms = freq_m
error = loss(tss_score, torch.FloatTensor(torch.Tensor([affin_score])))
mean = statistics.mean(iteration_loss)
stdev = statistics.stdev(iteration_loss)
print('Epoch Average Error: ' + str(mean) + '. Epoch Standard Deviation: ' + str(stdev))
After each epoch, I print out the average of all errors for that epoch as well as the standard deviation. Each epoch runs through about 45,000 sequences. However, after 10 epochs I'm still not seeing any improvement with my error and I'm unsure as to why. Here is the output I am seeing:
Are there any ideas as to what I'm doing wrong? I'm new to PyTorch so any help is appreciated! Thank you!
It turns out that casting the optimizer parameters into torch.nn.Parameter() makes the tensors fail to hold on to updates, and removing this now shows a decreasing error.

Custom combined hinge/kb-divergence loss function in siamese-net fails to generate meaningful speaker-embeddings

I'm currently trying to implement a siamese-net in Keras where I have to implement the following loss function:
loss(p ∥ q) = Is · KL(p ∥ q) + Ids · HL(p ∥ q)
detailed description of loss function from paper
Where KL is the Kullback-Leibler divergence and HL is the Hinge-loss.
During training, I label same-speaker pairs as 1, different speakers as 0.
The goal is to use the trained net to extract embeddings from spectrograms.
A spectrogram is a 2-dimensional numpy-array 40x128 (time x frequency)
The problem is I never get over 0.5 accuracy, and when clustering speaker-embeddings the results show there seems to be no correlation between embeddings and speakers
I implemented the kb-divergence as distance measure, and adjusted the hinge-loss accordingly:
def kullback_leibler_divergence(vects):
x, y = vects
x = ks.backend.clip(x, ks.backend.epsilon(), 1)
y = ks.backend.clip(y, ks.backend.epsilon(), 1)
return ks.backend.sum(x * ks.backend.log(x / y), axis=-1)
def kullback_leibler_shape(shapes):
shape1, shape2 = shapes
return shape1[0], 1
def kb_hinge_loss(y_true, y_pred):
y_true: binary label, 1 = same speaker
y_pred: output of siamese net i.e. kullback-leibler distribution
hinge = ks.backend.mean(ks.backend.maximum(MARGIN - y_pred, 0.), axis=-1)
return y_true * y_pred + (1 - y_true) * hinge
A single spectrogram would be fed into a branch of the base network, the siamese-net consists of two such branches, so two spectrograms are fed simultaneously, and joined in the distance-layer. The output of the base network is 1 x 128. The distance layer computes the kullback-leibler divergence and its output is fed into the kb_hinge_loss. The architecture of the base-network is as follows:
def create_lstm(units: int, gpu: bool, name: str, is_sequence: bool = True):
if gpu:
return ks.layers.CuDNNLSTM(units, return_sequences=is_sequence, input_shape=INPUT_DIMS, name=name)
return ks.layers.LSTM(units, return_sequences=is_sequence, input_shape=INPUT_DIMS, name=name)
def build_model(mode: str = 'train') -> ks.Model:
topology = TRAIN_CONF['topology']
is_gpu = tf.test.is_gpu_available(cuda_only=True)
model = ks.Sequential(name='base_network')
ks.layers.Bidirectional(create_lstm(topology['blstm1_units'], is_gpu, name='blstm_1'), input_shape=INPUT_DIMS))
model.add(ks.layers.Bidirectional(create_lstm(topology['blstm2_units'], is_gpu, is_sequence=False, name='blstm_2')))
if mode == 'extraction':
return model
num_units = topology['dense1_units']
model.add(ks.layers.Dense(num_units, name='dense_1'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
num_units = topology['dense2_units']
model.add(ks.layers.Dense(num_units, name='dense_2'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
num_units = topology['dense3_units']
model.add(ks.layers.Dense(num_units, name='dense_3'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
num_units = topology['dense4_units']
model.add(ks.layers.Dense(num_units, name='dense_4'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
return model
I then build a siamese net as follows:
base_network = build_model()
input_a = ks.Input(shape=INPUT_DIMS, name='input_a')
input_b = ks.Input(shape=INPUT_DIMS, name='input_b')
processed_a = base_network(input_a)
processed_b = base_network(input_b)
distance = ks.layers.Lambda(kullback_leibler_divergence,
name='distance')([processed_a, processed_b])
model = ks.Model(inputs=[input_a, input_b], outputs=distance)
adam = build_optimizer()
model.compile(loss=kb_hinge_loss, optimizer=adam, metrics=['accuracy'])
Lastly, I build a net with the same architecture with only one input, and try to extract embeddings, and then build the mean over them, where an embedding should serve as a representation for a speaker, to be used during clustering:
utterance_embedding = np.mean(embedding_extractor.predict_on_batch(spectrogram), axis=0)
We train the net on the voxceleb speaker set.
The full code can be seen here: GitHub repo
I'm trying to figure out if I have made any wrong assumptions and how to improve my accuracy.
Issue with accuracy
Notice that in your model:
y_true = labels
y_pred = kullback-leibler divergence
These two cannot be compared, see this example:
For correct results, when y_true == 1 (same
speaker), Kullback-Leibler is y_pred == 0 (no divergence).
So it's totally expected that metrics will not work properly.
Then, either you create a custom metric, or you count only on the loss for evaluations.
This custom metric should need a few adjustments in order to be feasible, as explained below.
Possible issues with the loss
This might be a problem
First, notice that you're using clip in the values for the Kullback-Leibler. This may be bad because clips lose the gradients in the clipped regions. And since your activation is a PRelu, you have values lower than zero and bigger than 1. Then there are certainly zero gradient cases here and there, with the risk of having a frozen model.
So, you might not want to clip these values. And to avoid having negative values with the PRelu, you can try to use a 'softplus' activation, which is kind of a soft relu without negative values. You might also "sum" an epsilon to avoid trouble, but there is no problem in leaving values bigger than one:
#considering you used 'softplus' instead of 'PRelu' in speakers
def kullback_leibler_divergence(speakers):
x, y = speakers
x = x + ks.backend.epsilon()
y = y + ks.backend.epsilon()
return ks.backend.sum(x * ks.backend.log(x / y), axis=-1)
Assimetry in Kullback-Leibler
This IS a problem
Notice also that Kullback-Leibler is not a symetric function, and also doesn't have its minimum at zero!! The perfect match is zero, but bad matches can have lower values, and this is bad for a loss function because it will drive you to divergence.
See this picture showing KB's graph
Your paper states that you should sum two losses: (p||q) and (q||p).
This eliminates the assimetry and also the negative values.
distance1 = ks.layers.Lambda(kullback_leibler_divergence,
name='distance1')([processed_a, processed_b])
distance2 = ks.layers.Lambda(kullback_leibler_divergence,
name='distance2')([processed_b, processed_a])
distance = ks.layers.Add(name='dist_add')([distance1,distance2])
Very low margin and clipped hinge
This might be a problem
Finally, see that the hinge loss also clips values below zero!
Since Kullback-Leibler is not limited to 1, samples with high divergency may not be controled by this loss. Not sure if this really an issue, but you might want to either:
increase the margin
inside the Kullback-Leibler, use mean instead of sum
use a softplus in hinge instead of a max, to avoid losing gradients.
MARGIN = someValue
hinge = ks.backend.mean(ks.backend.softplus(MARGIN - y_pred), axis=-1)
Now we can think of a custom accuracy
This is not very easy, since we don't have clear limits on KB that tells us "correct/not correct"
You might try one at random, but you'd need to tune this threshold parameter until you find a good thing that represents reality. You may for instance use your validation data to find the threshold that brings the best accuracy.
def customMetric(y_true_targets, y_pred_KBL):
isMatch = ks.backend.less(y_pred_KBL, threshold)
isMatch = ks.backend.cast(isMatch, ks.backend.floatx())
isMatch = ks.backend.equal(y_true_targets, isMatch)
isMatch = ks.backend.cast(isMatch, ks.backend.floatx())
return ks.backend.mean(isMatch)

Tensorflow: How to set the learning rate in log scale and some Tensorflow questions

I am a deep learning and Tensorflow beginner and I am trying to implement the algorithm in this paper using Tensorflow. This paper uses Matconvnet+Matlab to implement it, and I am curious if Tensorflow has the equivalent functions to achieve the same thing. The paper said:
The network parameters were initialized using the Xavier method [14]. We used the regression loss across four wavelet subbands under l2 penalty and the proposed network was trained by using the stochastic gradient descent (SGD). The regularization parameter (λ) was 0.0001 and the momentum was 0.9. The learning rate was set from 10−1 to 10−4 which was reduced in log scale at each epoch.
This paper uses wavelet transform (WT) and residual learning method (where the residual image = WT(HR) - WT(HR'), and the HR' are used for training). Xavier method suggests to initialize the variables normal distribution with
Q1. How should I initialize the variables? Is the code below correct?
weights = tf.Variable(tf.random_normal[img_size, img_size, 1, num_filters], stddev=stddev)
This paper does not explain how to construct the loss function in details . I am unable to find the equivalent Tensorflow function to set the learning rate in log scale (only exponential_decay). I understand MomentumOptimizer is equivalent to Stochastic Gradient Descent with momentum.
Q2: Is it possible to set the learning rate in log scale?
Q3: How to create the loss function described above?
I followed this website to write the code below. Assume model() function returns the network mentioned in this paper and lamda=0.0001,
inputs = tf.placeholder(tf.float32, shape=[None, patch_size, patch_size, num_channels])
labels = tf.placeholder(tf.float32, [None, patch_size, patch_size, num_channels])
# get the model output and weights for each conv
pred, weights = model()
# define loss function
loss = tf.nn.softmax_cross_entropy_with_logits_v2(labels=labels, logits=pred)
for weight in weights:
regularizers += tf.nn.l2_loss(weight)
loss = tf.reduce_mean(loss + 0.0001 * regularizers)
learning_rate = tf.train.exponential_decay(???) # Not sure if we can have custom learning rate for log scale
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum).minimize(loss, global_step)
NOTE: As I am a deep learning/Tensorflow beginner, I copy-paste code here and there so please feel free to correct it if you can ;)
Q1. How should I initialize the variables? Is the code below correct?
Use tf.get_variable or switch to slim (it does the initialization automatically for you). example
Q2: Is it possible to set the learning rate in log scale?
You can but do you need it? This is not the first thing that you need to solve in this network. Please check #3
However, just for reference, use following notation.
learning_rate_node = tf.train.exponential_decay(learning_rate=0.001, decay_steps=10000, decay_rate=0.98, staircase=True)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate_node).minimize(loss)
Q3: How to create the loss function described above?
At first, you have not written "pred" to "image" conversion to this message(Based on the paper you need to apply subtraction and IDWT to obtain final image).
There is one problem here, logits have to be calculated based on your label data. i.e. if you will use marked data as "Y : Label", you need to write
pred = model()
pred = tf.matmul(pred, weights) + biases
logits = tf.nn.softmax(pred)
loss = tf.reduce_mean(tf.abs(logits - labels))
This will give you the output of Y : Label to be used
If your dataset's labeled images are denoised ones, in this case you need to follow this one:
pred = model()
pred = tf.matmul(image, weights) + biases
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(image - labels))
Logits are the output of your network. You will use this one as result to calculate the rest. Instead of matmul, you can add a conv2d layer in here without a batch normalization and an activation function and set output feature count as 4. Example:
pred = model()
pred = slim.conv2d(pred, 4, [3, 3], activation_fn=None, padding='SAME', scope='output')
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(logits - labels))
This loss function will give you basic training capabilities. However, this is L1 distance and it may suffer from some issues (check). Think following situation
Let's say you have following array as output [10, 10, 10, 0, 0] and you try to achieve [10, 10, 10, 10, 10]. In this case, your loss is 20 (10 + 10). However, you have 3/5 success. Also, it may indicate some overfit.
For same case, think following output [6, 6, 6, 6, 6]. It still has loss of 20 (4 + 4 + 4 + 4 + 4). However, whenever you apply threshold of 5, you can achieve 5/5 success. Hence, this is the case that we want.
If you use L2 loss, for the first case, you will have 10^2 + 10^2 = 200 as loss output. For the second case, you will get 4^2 * 5 = 80.
Hence, optimizer will try to run away from #1 as quick as possible to achieve global success rather than perfect success of some outputs and complete failure of the others. You can apply loss function like this for that.
tf.reduce_mean(tf.nn.l2_loss(logits - image))
Alternatively, you can check for cross entropy loss function. (it does apply softmax internally, do not apply softmax twice)
tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, image))
Q1. How should I initialize the variables? Is the code below correct?
That's correct (although missing an opening parentheses). You could also look into tf.get_variable if the variables are going to be reused.
Q2: Is it possible to set the learning rate in log scale?
Exponential decay decreases the learning rate at every step. I think what you want is tf.train.piecewise_constant, and set boundaries at each epoch.
EDIT: Look at the other answer, use the staircase=True argument!
Q3: How to create the loss function described above?
Your loss function looks correct.
Other answers are very detailed and helpful. Here is a code example that uses placeholder to decay learning rate at log scale. HTH.
import tensorflow as tf
import numpy as np
# data simulation
N = 10000
D = 10
x = np.random.rand(N, D)
w = np.random.rand(D,1)
y =, w)
print y.shape
batch_size = 100
tni = tf.truncated_normal_initializer()
X = tf.placeholder(tf.float32, [batch_size, D])
Y = tf.placeholder(tf.float32, [batch_size,1])
W = tf.get_variable("w", shape=[D,1], initializer=tni)
B = tf.zeros([1])
lr = tf.placeholder(tf.float32)
pred = tf.add(tf.matmul(X,W), B)
print pred.shape
mse = tf.reduce_sum(tf.losses.mean_squared_error(Y, pred))
opt = tf.train.MomentumOptimizer(lr, 0.9)
train_op = opt.minimize(mse)
learning_rate = 0.0001
do_train = True
acc_err = 0.0
sess = tf.Session()
while do_train:
for i in range (100000):
if i > 0 and i % N == 0:
# epoch done, decrease learning rate by 2
learning_rate /= 2
print "Epoch completed. LR =", learning_rate
idx = i/batch_size + i%batch_size
f = {X:x[idx:idx+batch_size,:], Y:y[idx:idx+batch_size,:], lr: learning_rate}
_, err =[train_op, mse], feed_dict = f)
acc_err += err
if i%5000 == 0:
print "Average error = {}".format(acc_err/5000)
acc_err = 0.0

Debugging a Neural Network

I have been trying to fit a simple neural network on MNIST, and it works for a small debugging setup, but when I bring it over to a subset of MNIST, it trains super fast and the gradient is close to 0 very quickly, but then it outputs the same value for any given input and the final cost is quite high. I had been trying to purposefully overfit to make sure it is in fact working but it will not do so on MNIST suggesting a deep problem in the setup. I have checked my backpropagation implementation using gradient checking and it seems to match up, so not sure where the error lies, or what to work on now!
Many thanks for any help you can offer, I've been struggling to fix this!
I have been trying to make a neural network in Numpy, based on this explanation:
Backpropagation seems to match gradient checking:
Backpropagation: [ 0.01168585, 0.06629858, -0.00112408, -0.00642625, -0.01339408,
-0.07580145, 0.00285868, 0.01628148, 0.00365659, 0.0208475 ,
0.11194151, 0.16696139, 0.10999967, 0.13873069, 0.13049299,
-0.09012582, -0.1344335 , -0.08857648, -0.11168955, -0.10506167]
Gradient Checking: [-0.01168585 -0.06629858 0.00112408 0.00642625 0.01339408
0.07580145 -0.00285868 -0.01628148 -0.00365659 -0.0208475
-0.11194151 -0.16696139 -0.10999967 -0.13873069 -0.13049299
0.09012582 0.1344335 0.08857648 0.11168955 0.10506167]
And when I train on this simple debug setup:
a is a neural net w/ 2 inputs -> 5 hidden -> 2 outputs, and learning rate 0.5
ie. x1 = [0.1, 0.9] and y1 = [0,1]
I get these lovely training curves
Admittedly this is clearly a dumbed down, very easy function to fit.
However as soon as I bring it over to MNIST, with this setup:
# Number of input, hidden and ouput nodes
# Input = 28 x 28 pixels
# Arbitrary number of hidden nodes, experiment to improve
# Output = one of the digits [0,1,2,3,4,5,6,7,8,9]
# Learning rate
# Regularisation parameter
With this setup run on the code below, for 100 iterations, it does seem to train at first then just "flat lines" quite quickly and doesnt achieve a very good model:
Initial ===== Cost (unregularised): 2.09203670985 /// Cost (regularised): 2.09203670985 Mean Gradient: 0.0321241229793
Iteration 100 Cost (unregularised): 0.980999805477 /// Cost (regularised): 0.980999805477 Mean Gradient: -5.29639499854e-09
TRAINED IN 26.45932364463806
This then gives really poor test accuracy and predicts the same output, even when tested with all inputs being 0.1 or all 0.9 I just get the same output (although precisely which number it outputs varies depending on initial random weights):
Test accuracy: 8.92
Targets 2 2 1 7 2 2 0 2 3
Hypothesis 5 5 5 5 5 5 5 5 5
And the curves for the MNIST Training:
Code dump:
# Import dependencies
import numpy as np
import time
import csv
import matplotlib.pyplot
import random
import math
# Read in training data
with open('MNIST/mnist_train_100.csv') as file:
train_data=np.array([list(map(int,line.strip().split(','))) for line in file.readlines()])
# In[197]:
# Plot a sample of training data to visualise
displayData(train_data[:,1:], 25)
# In[198]:
# Read in test data
with open('MNIST/mnist_test.csv') as file:
test_data=np.array([list(map(int,line.strip().split(','))) for line in file.readlines()])
# Main neural network class
class neuralNetwork:
# Define the architecture
def __init__(self, i, h, o, lr, lda):
# Number of nodes in each layer
# Learning rate
# Lambda for regularisation
# Randomly initialise the parameters, input-> hidden and hidden-> output
def predict(self, X):
# Add bias node x(0)=1 for all training examples, X is now m x n+1
# Then compute activation to hidden node,self.ih.T) + 1
# Add bias node h(0)=1 for all training examples, H is now m x h+1
# Then compute activation to output node,self.ho.T) + 1
return outputs
def backprop (self, X, y):
m = X.shape[0]
# Add bias node x(0)=1 for all training examples, X is now m x n+1
# Then compute activation to hidden node,self.ih.T)
# Add bias node h(0)=1 for all training examples, H is now m x h+1
# Then compute activation to output node,self.ho.T)
# Compute error/ cost for this setup (unregularised and regularise)
# Output error term
# Hidden error term,self.ho)*sigmoidGradient(z2)
# Partial derivatives for weights,a2),X)
# Partial derivatives of theta with regularisation
# Update weights
# Hidden layer (weights 1)*(((D1)/m) + (self.lda/m)*self.ih)
# Output layer (weights 2)*(((D2)/m) + (self.lda/m)*self.ho)
# Unroll gradients to one long vector
return costReg, costUn, grad
def backpropIter (self, X, y):
m = X.shape[0]
# Add bias node x(0)=1 for all training examples, X is now m x n+1
# Then compute activation to hidden node,self.ih.T)
# Add bias node h(0)=1 for all training examples, H is now m x h+1
# Then compute activation to output node,self.ho.T)
# Compute error/ cost for this setup (unregularised and regularise)
for i in range(m):
delta3 = -(y[i,:]-h[i,:])*sigmoidGradient(z3[i,:])
delta2 =,delta3)*sigmoidGradient(z2[i,:])
gradW2= gradW2 + np.outer(delta3,a2[i,:])
gradW1 = gradW1 + np.outer(delta2,X[i,:])
# Update weights
# Hidden layer (weights 1)*(((gradW1)/m) + (self.lda/m)*self.ih)
# Output layer (weights 2)*(((gradW2)/m) + (self.lda/m)*self.ho)
# Unroll gradients to one long vector
return costUn, costReg, grad
def gradDesc(self, X, y):
# Backpropagate to get updates
# Unroll parameters
# m = no. training examples
#print (self.ih)
self.ih -= * ((deltaW1))#/m) + (self.lda * self.ih))
self.ho -= * ((deltaW2))#/m) + (self.lda * self.ho))
return cost,costreg,grad
# Gradient checking to compute the gradient numerically to debug backpropagation
def gradCheck(self, X, y):
# Unroll theta
# perturb will add and subtract epsilon, numgrad will store answers
# epsilon, e is a small number
e = 0.00001
# Loop over all theta
for i in range(len(theta)):
# Perturb is zeros with one index being e
loss1=self.costFuncGradientCheck(theta-perturb, X, y)
loss2=self.costFuncGradientCheck(theta+perturb, X, y)
# Compute numerical gradient and update vectors
return numgrad
def costFuncGradientCheck(self,theta,X,y):
# Compute activation to hidden node,T1.T)
# Compute activation to output node,T2.T)
cost=self.costFunc(h, y)
return cost #+ ((self.lda/2)*(np.sum(pow(T1,2)) + np.sum(pow(T2,2))))
def costFunc(self, h, y):
return np.sum(pow((h-y),2))/m
def costFuncReg(self, h, y):
cost=self.costFunc(h, y)
return cost #+ ((self.lda/2)*(np.sum(pow(self.ih,2)) + np.sum(pow(self.ho,2))))
# Helper functions to compute sigmoid and gradient for an input number or matrix
def sigmoid(Z):
return np.divide(1,np.add(1,np.exp(-Z)))
def sigmoidGradient(Z):
return sigmoid(Z)*(1-sigmoid(Z))
# Pre=processing helper functions
# Normalise data to 0.1-1 as 0 inputs kills the weights and changes
def scaleDataVec(data):
return (np.asfarray(data[1:]) / 255.0 * 0.99) + 0.1
def scaleData(data):
return (np.asfarray(data[:,1:]) / 255.0 * 0.99) + 0.1
# plot_data will be what to plot, num_ex must be a square number of how many examples to plot, random examples will then be plotted
def displayData(plot_data, num_ex, rand=1):
if rand==0:
# Useful variables, m= no. train ex, n= no. features
# Shape for one example
# No. of items to display
# Padding between images
# Setup blank display
display_array = -np.ones((pad + display_rows * (example_height + pad), (pad + display_cols * (example_width + pad))))
for i in range(1,display_rows+1):
for j in range(1,display_cols+1):
if curr_ex>m:
# Max value of this patch
max_val=max(abs(data[curr_ex, :]))
display_array[pad + (j-1) * (example_height + pad) : j*(example_height+1), pad + (i-1) * (example_width + pad) : i*(example_width+1)] = data[curr_ex, :].reshape(example_height, example_width)/max_val
matplotlib.pyplot.imshow(display_array, cmap='Greys', interpolation='None')
# In[312]:
for i in range(100):
# Debugging plot
# In[313]:
# Class instance
# Number of input, hidden and ouput nodes
# Input = 28 x 28 pixels
# Arbitrary number of hidden nodes, experiment to improve
# Output = one of the digits [0,1,2,3,4,5,6,7,8,9]
# Learning rate
# Regularisation parameter
# Create instance of Nnet class
# In[314]:
# Scale inputs
# 0.01-0.99 range as the sigmoid function can't reach 0 or 1, 0.01 for all except 0.99 for target
for i in range(iterations):
j,jr,grad=nn.gradDesc(inputs, targets)
if i == 0:
print("Initial ===== Cost (unregularised): ", j, "\t///", "Cost (regularised): ",jr," Mean Gradient: ",grad)
print("\r", end="")
print("Iteration ", i+1, "\tCost (unregularised): ", j, "\t///", "Cost (regularised): ", jr," Mean Gradient: ",grad,end="")
time2 = time.time()
print ("\nTRAINED IN ",time2-time1)
# In[315]:
# Debugging plot
# In[316]:
# Scale inputs
# 0.01-0.99 range as the sigmoid function can't reach 0 or 1, 0.01 for all except 0.99 for target
for i,line in enumerate(targets):
if line == h[i]:
print("Test accuracy: ", sum(score)/len(score)*100)
print("Targets ",end="")
for j in indexes:
print (targ[j]," ",end="")
print("\nHypothesis ",end="")
for j in indexes:
print (hyp[j]," ",end="")
displayData(test_data[indexes, 1:], 9, rand=0)
# In[277]:
Edit 1
Suggested to use different learning rates, but unfortunately, they all come out with similar results, here are the plots for 30 iterations, using the MNIST 100 subset:
Concretely, here are the figures that they start and end with:
Initial ===== Cost (unregularised): 4.07208963507 /// Cost (regularised): 4.07208963507 Mean Gradient: 0.0540251381858
Iteration 50 Cost (unregularised): 0.613310215166 /// Cost (regularised): 0.613310215166 Mean Gradient: -0.000133981500849Initial ===== Cost (unregularised): 5.67535252616 /// Cost (regularised): 5.67535252616 Mean Gradient: 0.0644797515914
Iteration 50 Cost (unregularised): 0.381080434935 /// Cost (regularised): 0.381080434935 Mean Gradient: 0.000427866902699Initial ===== Cost (unregularised): 3.54658422176 /// Cost (regularised): 3.54658422176 Mean Gradient: 0.0672211732868
Iteration 50 Cost (unregularised): 0.981 /// Cost (regularised): 0.981 Mean Gradient: 2.34515341943e-20Initial ===== Cost (unregularised): 4.05269658215 /// Cost (regularised): 4.05269658215 Mean Gradient: 0.0469666696193
Iteration 50 Cost (unregularised): 0.980999999999 /// Cost (regularised): 0.980999999999 Mean Gradient: -1.0582706063e-14Initial ===== Cost (unregularised): 2.40881492228 /// Cost (regularised): 2.40881492228 Mean Gradient: 0.0516056901574
Iteration 50 Cost (unregularised): 1.74539997258 /// Cost (regularised): 1.74539997258 Mean Gradient: 1.01955789614e-09Initial ===== Cost (unregularised): 2.58498876008 /// Cost (regularised): 2.58498876008 Mean Gradient: 0.0388768685257
Iteration 3 Cost (unregularised): 1.72520399313 /// Cost (regularised): 1.72520399313 Mean Gradient: 0.0134040908157
Iteration 50 Cost (unregularised): 0.981 /// Cost (regularised): 0.981 Mean Gradient: -4.49319474346e-43Initial ===== Cost (unregularised): 4.40141352357 /// Cost (regularised): 4.40141352357 Mean Gradient: 0.0689167742968
Iteration 50 Cost (unregularised): 0.981 /// Cost (regularised): 0.981 Mean Gradient: -1.01563966458e-22
A learning rate of 0.01, quite low, has the best outcome, but exploring learning rates in this region, I only came out with 30-40% accuracy, a big improvement on the 8% or even 0% that I had seen previously, but not really what it should be achieving!
Edit 2
I've now finished and added a backpropagation function optimized for matrices rather than the iterative formula, and so now I can run it on large epochs/ iterations without painfully slow. So the "backprop" function of the class matches with gradient check (in fact it is 1/2 the size but I think that is a problem in gradient check, so we'll leave that bc it should not matter proportionally and I have tried with adding in divisions to solve this). With large numbers of epochs I achieved a much better accuracy, but still there seems to be a problem, as when I have previously programmed a slightly different style of simple 3 layer neural network a part of a book, on the same dataset csvs, I get a much better training result. Here are some plots and data for large epochs.
Looks good but, we still have a pretty poor test set accuracy, and this is for 2,500 runs through the dataset, should be getting a good result with much less!
Test accuracy: 61.150000000000006
Targets 6 9 8 2 2 2 4 3 8
Hypothesis 6 9 8 4 7 1 4 3 8
Edit 3, what dataset?
Used train.csv and test.csv to try with more data and no better just takes longer so been using the subset train_100 and test_10 while I debug.
Edit 4
Seems to learn something after a very large number of epochs (like 14,000), as the whole dataset is used in the backprop function (not backpropiter) each loop is effectively an epoch, and with a ridiculous amount of epochs on the subset of 100 train and 10 test samples, the test accuracy is quite good. However with this small a sample this could easily be due to just chance and even then it's only 70% percent not what you'd be aiming for even on the small dataset. But it does show that it seems to be learning, I am trying parameters very extensively to rule that out.
I solved my neural network. A brief description follows in case it helps anyone else. Thanks to all those that helped with suggestions.
Basically, I had implemented it with a fully matrix approach ie. the backpropagation uses all examples each time. I later tried implementing it as a vector approach ie. backpropagation with each example. This was when I realised that the matrix approach doesn't update the parameters each example, so one run through this way is NOT the same as one run through each example in turn, effectively the whole training set is backpropagated as one example. Hence, my matrix implementation does work, but after many iterations, which then ends up taking longer than the vector approach anyway! Have opened a new question to learn more about this specific part but there we go, it just needed a lot of iterations with the matrix approach or a more gradual example by example approach.
You can "debug" your neural network using Tensorleap. It's a neural network debugging platform which uses some explainability algorithms. It allows you to upload your model and dataset and get a lot of information about your trained model.
MNIST is already in as a demo project.
They have a free trial for 14 days as I know.

